OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters
Digitizing historical documents is crucial in preserving the literary heritage. With the availability of low cost capturing devices, libraries and institutes all over the world have old literature preserved in the form of scanned documents. However, searching through these scanned images is still a tedious job as one is unable to search through them. Contemporary machine learning approaches have been applied successfully to recognize text in both printed and handwriting form; however, these approaches require a lot of transcribed training data in order to obtain satisfactory performance. Transcribing the documents manually is a laborious and costly task, requiring many man-hours and language-specific expertise. This paper presents a generic iterative training framework to address this issue. The proposed framework is not only applicable to historical documents, but for present-day documents as well, where manually transcribed training data is unavailable. Starting with the minimal information available, the proposed approach iteratively corrects the training and generalization errors. Specifically, we have used a segmentation-based OCR method to train on individual symbols and then use the semi-corrected recognized text lines as the ground-truth data for segmentation-free sequence learning, which learns to correct the errors in the ground-truth by incorporating context-aware processing. The proposed approach is applied to a collection of 15th century Latin documents. The iterative procedure using segmentation-free OCR was able to reduce the initial character error of about 23% (obtained from segmentation-based OCR) to less than 7% in few iterations.