IMPROVED OCR QUALITY FOR SMART SCANNED DOCUMENT MANAGEMENT SYSTEM

Authors

  • Viet Anh Pham Le Quy Don Technical University
  • Duy Tung Khanh Nguyen Le Quy Don Technical University
  • Manh Dat Tran Le Quy Don Technical University
  • Van Dan Pham Le Quy Don Technical University

DOI:

https://doi.org/10.56651/lqdtu.jst.v9.n01.60.ict

Keywords:

Optical Character Recognition (OCR), Table Recognition, Image Deskewing, Document Layout Analysis

Abstract

The quality of the document images is a crucial factor for the performance of an Optical Character Recognition (OCR) model. Various issues from the input data hinder the recognition success such as heterogeneous layouts, skewness and proportional fonts. This paper investigated several algorithms for data pre-processing including image deskewing, table and document layout analysis to improve the accuracy of the OCR model and then built an end-to-end scanned document management system. We verified the algorithms using a well-known OCR software namely Tesseract. The experiments on a real dataset shown that our methods can accurately process document images with arbitrary angles of rotation, and different layouts. As a result, the accuracy by words of Tesseract can boost 23% for documents with complex structures. The quality of the output text allows to build a system to store and search documents efficiently. Index

Downloads

Published

2020-05-14

Issue

Section

Articles