The article considers a method of automated formation of a training data set for machine learning algorithms for classification of electronic documents, which differs from the known ones by forming training data sets based on the synthesis of clustering and data augmentation methods based on calculating the distance between objects in multidimensional spaces.
Keywords: teaching with a teacher, clustering, pattern recognition, machine learning algorithm, electronic document, vectorization, formalized documents
The article considers the methodology of forming and determining the parameters of machine learning algorithms for classifying electronic documents according to the importance of information for officials of organizations, which differs from the known ones by the dynamic formation of the structure and number of machine learning algorithms, due to the automated determination of sets of structural divisions of the organization, sets of keywords reflecting the tasks and functions of structural divisions in the process of automated analysis of the Organization's Regulations, The positions of structural units based on the theory of pattern recognition.
Keywords: lemmatization, pattern recognition, machine learning algorithm, electronic document, vectorization, formalized documents