Automatic Word Ground Truth Generation for Camera Captured Documents
A database for camera captured documents is useful to train OCRs to obtain better performance. However, no dataset exists for camera captured documents because it is very laborious and costly to build these datasets manually. In this paper, a fully automatic approach allowing building the very large scale (i.e., millions of images) labeled camera captured documents dataset is proposed. The proposed approach does not require any human intervention in labeling. Evaluation of samples generated by the proposed approach shows that more than 97% of the images are correctly labeled. Novelty of the proposed approach lies in the use of document image retrieval for automatic labeling, especially for camera captured documents, which contain different distortions specific to camera, e.g., blur, perspective distortion, etc.