Translitration of Arabic document images based on an optical shape recognition (OSR) technique

We are working on a complete processing/transliteration system for Arabic manuscripts. The system is based on sub-words/ connected components (CCs). Therefore, we call it an Optical Shape Recognition (OSR) system. In Arabic language, there are approximately 100000 sub-words. Therefore, the system is facing a 100000-class problem, which is very difficult if not impossible to be solved. Instead of working on the original problem, we use a modified version of the binary problem methodology [1]. In this methodology, a small set of overlapping binary problems is used to replace the original highly multiclass problem. For any new document, the feature vector of the sub-words is calculated according to their skeleton and some a priori information, such as the average stroke width. Then, the binary labels of each sub-word are obtained using the trained machines, and are combined to generate candidate strings of each sub-word. Using a dictionary, the set of strings is pruned, and the final set of candidate strings for each sub-word is provided as the output of the system. The learning machines are trained on two databases that we have developed for this purpose: IBN SINA and Avicenna. The SVMs are used as the learning machines. [1] - R. Farrahi Moghaddam, M. Cheriet, M. M. Adankon, K. Filonenko, and R. Wisnovsky. IBN SINA: a database for research on processing and understanding of Arabic manuscripts images. DAS’10, pages 11–18, Boston, Massachusetts, 2010.

ericssonlogo
inocybelogo
canalogo
cienalogo
Civimetrix Telecom logo
mitacslogo
risq logo
nserclogo
promptlogo
ecolepolytechniquelogo
University of Torontologo
frqntlogo
uqlogo
MDEIE logo
cfilogo
ciraiglogo