Document Processing and Understanding

Processing of huge volumes of unprocessed, handwritten and historical documents is a critical challenge in front of many heritage and cultural institutes and organizations. Our main objective in the field of document image processing and understanding is development and implementation of novel models and techniques which may help in generating, enhancing, presenting and understanding of handwritten document images. Direct involvement of scholars and researchers from various institutes and universities, such as McGill University, allowed us to choose goal-oriented directions for our research and development. Currently, our focus is on providing a complete understanding system which consists of imaging, pre-processing/enhancement, word-spotting, transliteration and data mining units in an user-friendly collaborative and virtual environment.

In terms of imaging, multi-spectral infra-red imaging is considered toward almost-true virtual replacements for physical historical documents. Enhancement, image segmentation and binarization, and line and word segmentation are under extensive study in the preprocessing part. At the understanding level, we are working on both word spotting and transliteration using innovative methods which are segmentation-free and explore new directions in this field. For a sample list of our works, please see the below. We also welcome collaboration at any level.

Here, you see an example output from one of our fully automatic text binarization algorithm:

Sample input and output images

Articles in this category

Translitration of Arabic document images based on an optical shape recognition (OSR) technique

We are working on a complete processing/transliteration system for Arabic manuscripts. The system is based on sub-words/ connected components (CCs). Therefore, we call it an Optical Shape Recognition (OSR) system. In Arabic language, there are approximately 100000 sub-words. Therefore, the system is facing a 100000-class problem, which is very difficult if not impossible to be solved. Instead of working on the original problem, we use a modified version of the binary problem methodology [1].

Arabic word recognition


A vast collection of Arabic historical document has been digitalized by national archive and museums. Despite being in digital form, the study of historical documents is still difficult, because historical document image doesn’t benefit from high level features such as word search. Deciphering old manuscript is a difficult task because of degradations that may alter the document content, the use of calligraphic styles that are no longer employed, in addition to the large handwriting variability.

OCR Correction

OCR schema

Researches on correcting the output of the OCR of scanned image documents, in the field of document image analysis and processing have been conducted during the past years, because large numbers of the printed and/or handwritten documents are only available in image format. Dealing effectively with these documents requires to correctly recognizing their contents, after being scanned. Although the claim that OCR recognition rate is higher than 90%, there is no doubt that this rate is much less with non-Latin languages such as Arabic, due to its particularities and complex morphology.

Degradation modeling

Defect model

"Knowing the problem is half way to meeting the solution."

Degradation modeling is a key step in development of restoration and enhancement method for document images. For document images which suffer imaging degradation, there are many well-developed models which can be used to generate datasets of text in the form of a single character, a single word, or even a whole page.

Diffusion-based enhancement methods

Schema diffusion

Diffusion-based and PDE-based methods are powerful tools in image processing. These methods are extremely local, and therefore are very suitable for problems where there is a high degree spatial correlation. for example in the field of document image processing where many of physical phenomena which occur to the paper and ink can be easily described using some simple diffusion-based models. However, direct application of diffusion-based methods to the degraded historical document images is not very successful.

Multispectral imaging (MSI)

MSI Schema

Multispectral imaging (MSI) has been used in many scientific and industrial applications such as space exploration, remote sensing, medical diagnosis, etc. One of interesting applications of MSI is study and preservation of cultural heritage including artworks and ancient manuscripts. The advantage of the MSI is that it is able to analyze the data on different wavelengths even outside the visible spectrum which may be hidden to human eye. MSI is very useful in extraction of text from historical manuscripts and palimpsests. MS images also provide more legible views of fainted manuscripts.

Civimetrix Telecom logo
risq logo
University of Torontologo
MDEIE logo