Abstract:

In this paper, we propose a novel text block identification method for ancient document understanding. Unlike traditional top-down and bottom-up approaches, our method is based on supervised learning on the patches of document images, which can be considered as an intermediate level method but integrates essential advantages of both the top-down and the bottom-up strategies. In our method, the document images are firstly partitioned into small patches, and then positive and negative patches are selected to form an active training set. Gabor features are extracted on each patch, while multi-linear discriminant analysis (MDA) is employed to reduce the dimensionality of the data. To deal with unseen documents, a random forest classifier is learned on the new representations of the patches. Compared to traditional approaches, our method can not only capture local texture features of each patch, but also preserve the global information of the training images. Furthermore, MDA is guaranteed to learn a low dimensional tensor subspace, which significantly avoids the curse of dimensionality dilemma. Moreover, the random forest classifier can automatically select useful features and deliver satisfactory identification results. Extensive experiments on some scripts of ancient document images demonstrated the effectiveness of our method.