Abstract

Automatic recognition of Arabic words is a challenging task and its complexity increases as the lexicon grows. In pre-modern documents, the vocabulary is unconstrained; therefore a lexicon-reduction strategy is needed to reduce the recognition computational complexity. This paper proposes a novel lexicon-reduction method for Arabic subwords based on their shapes’ topology and geometry. First the sub-word shape’s topological and geometrical information is extracted from its skeleton and encoded into a graph. Then the graph is converted into a topological signature vector (TSV) which preserves the graph structure. The lexicon is reduced based on the TSV distance between the lexicon sub-words’ shapes and a query shape, by keeping the i nearest subwords. The value of i is selected according to a predetermined lexicon-reduction accuracy. The proposed framework has been tested on a database of pre-modern Arabic subword shapes with promising results.