Инструменты решения проблем распознавания и кластеризации данных из документов методами машинного обучения

Золотарев О.В.; Юрчак В.А.

Tools for solving problems of recognition and clustering of data from documents using machine learning methods

Zolotarev O.V., Yurchak V.A.

Incoming article date: 09.01.2023

The article describes the possibilities, advantages and differences of machine learning systems without a teacher from template learning. The definition of clustering is also given, indicating the main methods and tasks solved by this machine learning algorithm. The algorithm for recognizing data from documents using OCR technology is described in detail, the goals and objectives of using OCR technology in the business processes of IT companies are formed. The following are tools for solving the problem of recognizing and clustering data from PDF document scans using the Nanonets and Tesseract machine learning libraries. In conclusion, this article describes the advantages and disadvantages of using these libraries in solving the problem of recognizing and clustering data from document scans.

Keywords: machine learning, clustering, data recognition, library Nanonets, library Tesseract