9313

Binarization-free OCR for historical documents using LSTM networks

Mohammad Reza Yousefi, Mohammad Reza Soheili, Thomas Breuel, Ehsanollah Kabir, Didier Stricker

13th International Conference on Document Analysis and Recognition | International Conference on Document Analysis and Recognition (ICDAR-13), August 23-26, Nancy, France , Pages: 1-5 , IEEE , 2015
A primary preprocessing block of almost any typical OCR system is binarization, through which it is intended to remove unwanted part of the input image, and only keep a binarized and cleaned-up version for further processing. The binarization step does not, however, always perform perfectly, and it can happen that binarization artifacts result in important information loss, by for instance breaking or deforming character shapes. In historical documents, due to a more dominant presence of noise and other sources of degradations, the performance of binarization methods usually deteriorates; as a result the performance of the recognition pipeline is hindered by such preprocessing phases. In this paper, we propose to skip the binarization step by directly training a 1D Long Short Term Memory (LSTM) network on gray-level text lines. We collect a large set of historical Fraktur documents, from publicly available online sources, and form train and test sets for performing experiments on both gray-level and binarized text lines. In order to observe the impact of resolution, the experiments are carried out on two identical sets of low and high resolutions. Overall, using gray-level text lines, the 1D LSTM network can reach 25% and 12.5% lower error rates on the low- and high-resolution sets, respectively, compared to the case of using binarization in the recognition pipeline.

Show BibTex:

@inproceedings {
       abstract = {A primary preprocessing block of almost any typical
OCR system is binarization, through which it is intended
to remove unwanted part of the input image, and only keep
a binarized and cleaned-up version for further processing. The
binarization step does not, however, always perform perfectly,
and it can happen that binarization artifacts result in important
information loss, by for instance breaking or deforming character
shapes. In historical documents, due to a more dominant presence
of noise and other sources of degradations, the performance
of binarization methods usually deteriorates; as a result the
performance of the recognition pipeline is hindered by such
preprocessing phases. In this paper, we propose to skip the
binarization step by directly training a 1D Long Short Term
Memory (LSTM) network on gray-level text lines. We collect a
large set of historical Fraktur documents, from publicly available
online sources, and form train and test sets for performing
experiments on both gray-level and binarized text lines. In order
to observe the impact of resolution, the experiments are carried
out on two identical sets of low and high resolutions. Overall,
using gray-level text lines, the 1D LSTM network can reach 25%
and 12.5% lower error rates on the low- and high-resolution
sets, respectively, compared to the case of using binarization in
the recognition pipeline.},
       number = {}, 
       month = {8}, 
       year = {2015}, 
       title = {Binarization-free OCR for historical documents using LSTM networks}, 
       journal = {}, 
       volume = {}, 
       pages = {1-5}, 
       publisher = {IEEE}, 
       author = {Mohammad Reza Yousefi, Mohammad Reza Soheili, Thomas Breuel, Ehsanollah Kabir, Didier Stricker}, 
       keywords = {},
       url = {http://www.dfki.de/web/forschung/publikationen/renameFileForDownload?filename=Binarization-free OCR for historical documents using LSTM networks.pdf&file_id=uploads_3260}
}