OCR Correction

Researches on correcting the output of the OCR of scanned image documents, in the field of document image analysis and processing have been conducted during the past years, because large numbers of the printed and/or handwritten documents are only available in image format. Dealing effectively with these documents requires to correctly recognizing their contents, after being scanned. Although the claim that OCR recognition rate is higher than 90%, there is no doubt that this rate is much less with non-Latin languages such as Arabic, due to its particularities and complex morphology. Moreover, most OCR focus on reporting misspelled words not found in the dictionary used by their engine, or lexical errors (LE), and proposes a set of candidates to their low confident output, and the user has to select from them, but this number is too big with Arabic OCR, can reach twenties of words, so we aim also to reduce this list by eliminating the less probable words, if not fully automated the correction process.

On the other hand, OCR engines sometimes do not assign low confidence to erroneous words found in the dictionary used, for proofreading, even if these words are out of context. This is very frequent when dealing with Arabic words due to the characteristics. When these words are expected to be false, a large candidate list is proposed, and sometimes do not include the correct word. We are interested especially in these errors, because they constitute a significant percentage in the OCR error rate, and are still problematic as they have to be manually proofread and detected and sometimes they are even skipped by humans. These words constitute often the meaningful terms that represent the document indices, and misrecognizing them lead to a poor retrieval. We call these “semantic errors”.

In our work we tackle three aspects: the first is to model a general correcting rules that targets classes of characters rather than a single specific character to improve the concept of error-n-gram and alignment to be used in correcting the errors either the lexical or semantic ones, as well as other types of errors that we will determine. Second make use of the language models for Arabic word analysis to solve the problems of subwords, stop words, agglutination. Finally improve the language model to be more semantic which ensures a more reflective probability distribution because same semantic n-gram will be grouped together as opposed to the n-gram normally used in correction.

Our proposed system is shown in the following figure:

Some results and contributions:

  1. Improving and generalize the concept of character n-gram by adding the concept of character meta-class
  2. Improving the language models, by using the topic corpora instead of the global one, and using the bidirectional n-gram and stop words removal
  3. Correcting semantic errors
  4. Determine isolated letters, agglutinated words using the language models
Civimetrix Telecom logo
risq logo
University of Torontologo
MDEIE logo