Document processing

Translitration of Arabic document images based on an optical shape recognition (OSR) technique

We are working on a complete processing/transliteration system for Arabic manuscripts. The system is based on sub-words/ connected components (CCs). Therefore, we call it an Optical Shape Recognition (OSR) system. In Arabic language, there are approximately 100000 sub-words. Therefore, the system is facing a 100000-class problem, which is very difficult if not impossible to be solved. Instead of working on the original problem, we use a modified version of the binary problem methodology [1].

Arabic word recognition

06M_TL4S1D0953_6-1_both_250.png

A vast collection of Arabic historical document has been digitalized by national archive and museums. Despite being in digital form, the study of historical documents is still difficult, because historical document image doesn’t benefit from high level features such as word search. Deciphering old manuscript is a difficult task because of degradations that may alter the document content, the use of calligraphic styles that are no longer employed, in addition to the large handwriting variability.

OCR Correction

OCR schema

Researches on correcting the output of the OCR of scanned image documents, in the field of document image analysis and processing have been conducted during the past years, because large numbers of the printed and/or handwritten documents are only available in image format. Dealing effectively with these documents requires to correctly recognizing their contents, after being scanned. Although the claim that OCR recognition rate is higher than 90%, there is no doubt that this rate is much less with non-Latin languages such as Arabic, due to its particularities and complex morphology.

Degradation modeling

Defect model

"Knowing the problem is half way to meeting the solution."

Degradation modeling is a key step in development of restoration and enhancement method for document images. For document images which suffer imaging degradation, there are many well-developed models which can be used to generate datasets of text in the form of a single character, a single word, or even a whole page.

Diffusion-based enhancement methods

Schema diffusion

Diffusion-based and PDE-based methods are powerful tools in image processing. These methods are extremely local, and therefore are very suitable for problems where there is a high degree spatial correlation. for example in the field of document image processing where many of physical phenomena which occur to the paper and ink can be easily described using some simple diffusion-based models. However, direct application of diffusion-based methods to the degraded historical document images is not very successful.

Multispectral imaging (MSI)

MSI Schema

Multispectral imaging (MSI) has been used in many scientific and industrial applications such as space exploration, remote sensing, medical diagnosis, etc. One of interesting applications of MSI is study and preservation of cultural heritage including artworks and ancient manuscripts. The advantage of the MSI is that it is able to analyze the data on different wavelengths even outside the visible spectrum which may be hidden to human eye. MSI is very useful in extraction of text from historical manuscripts and palimpsests. MS images also provide more legible views of fainted manuscripts.

Syndicate content
ericssonlogo
inocybelogo
canalogo
cienalogo
Civimetrix Telecom logo
mitacslogo
risq logo
nserclogo
promptlogo
ecolepolytechniquelogo
University of Torontologo
frqntlogo
uqlogo
MDEIE logo
cfilogo
ciraiglogo