IMPROVED OCR QUALITY FOR SMART SCANNED DOCUMENT MANAGEMENT SYSTEM
DOI:
https://doi.org/10.56651/lqdtu.jst.v9.n01.60.ictKeywords:
Optical Character Recognition (OCR), Table Recognition, Image Deskewing, Document Layout AnalysisAbstract
The quality of the document images is a crucial factor for the performance of an Optical Character Recognition (OCR) model. Various issues from the input data hinder the recognition success such as heterogeneous layouts, skewness and proportional fonts. This paper investigated several algorithms for data pre-processing including image deskewing, table and document layout analysis to improve the accuracy of the OCR model and then built an end-to-end scanned document management system. We verified the algorithms using a well-known OCR software namely Tesseract. The experiments on a real dataset shown that our methods can accurately process document images with arbitrary angles of rotation, and different layouts. As a result, the accuracy by words of Tesseract can boost 23% for documents with complex structures. The quality of the output text allows to build a system to store and search documents efficiently. Index