Translitration of Arabic document images based on an optical shape recognition (OSR) technique

We are working on a complete processing/transliteration system for Arabic manuscripts. The system is based on sub-words/ connected components (CCs). Therefore, we call it an Optical Shape Recognition (OSR) system. In Arabic language, there are approximately 100000 sub-words. Therefore, the system is facing a 100000-class problem, which is very difficult if not impossible to be solved. Instead of working on the original problem, we use a modified version of the binary problem methodology [1]. In this methodology, a small set of overlapping binary problems is used to replace the original highly multiclass problem. For any new document, the feature vector of the sub-words is calculated according to their skeleton and some a priori information, such as the average stroke width. Then, the binary labels of each sub-word are obtained using the trained machines, and are combined to generate candidate strings of each sub-word. Using a dictionary, the set of strings is pruned, and the final set of candidate strings for each sub-word is provided as the output of the system. The learning machines are trained on two databases that we have developed for this purpose: IBN SINA and Avicenna. The SVMs are used as the learning machines. [1] - R. Farrahi Moghaddam, M. Cheriet, M. M. Adankon, K. Filonenko, and R. Wisnovsky. IBN SINA: a database for research on processing and understanding of Arabic manuscripts images. DAS’10, pages 11–18, Boston, Massachusetts, 2010.

Civimetrix Telecom logo
risq logo
University of Torontologo
MDEIE logo