Machine learning for document understanding

In order to provide training data for optical shape recognition (OSR), two databases of different sizes have been created in collaboration with Prof. Robert Wisnovsky (Institute of Islamic Studies, McGill University):

1. IBN SINA database: A database of 22720 shapes (fast access and fast access).

2. Avicenna database: A database of 123,007 labeled and not labeled shapes (fast access).

3. IBN SINA Ext database (with images): A database of more than 22720 shapes with their color images (fast access with binarized images (201)fast access with color images (66)fast access to the guide (53)).

These databases have been used in two learning challenges. The details of databases and challenges can be found below.

Online handwritten gestures for interaction

1. The SIGN On-Line Database

Articles in this category

PERSIAN HERITAGE IMAGE BINARIZATION DATASET (PHIBD 2012)

PHIBD is the first groundtruthed Persian Heritage Image Binarization Dataset developed using an efficient ground thruthing tool called “PhaseGT” [1]. The PHIBD 2012 contains 15 historical document images with their corresponding ground truth binary images. The historical images in the dataset suffer from various types of degradation. It has been also divided into two subsets of training and testing images for those binarization method that use learning approaches. For more information, please visit the IAPR-TC11 website: http://www.iapr-tc11.org/mediawiki/index.php/Datasets_List

THE SIGN ON-LINE DATABASE

Overview
The Synchromedia-Imadoc Gesture New On-Line Database (SIGN-OnDB) contains data corresponding to on-line handwritten gestures. It can be used to train and test gesture recognition systems, used in applications associating specific gestures to edit functions, for example (like copying digital ink elements).
The data was acquired on Tablet PCs and whiteboards.

IBN SINA EXT DATABASE (WITH SUB-WORD IMAGES)

IBN SINA Ext database is an extension to IBN SINA database, which was published earlier as part of Active Learning Challenge 2010, and reported in DAS’10. In the extended database, more data, including the IMAGES of sub-words are available. Please see the guide for more detail.

Direct link to database:

1. Just with binarized images (smaller file, 17MB): download (201)

IBN SINA DATABASE

Database Name: IBN SINA

Manuscript Title: Kitab Kashf al-tamwihat fi sharh al-Tanbīhāt (fol.1a) (Commentary on the Persian philosopher Ibn Sina’s al-Isharat wa-al-tanbihat)
Author: Abu al-Hasan Ali ibn Abi Ali ibn Muhammad al-Amidi (d.641/1243 or 631/1233)

Year: Before 641/1243

Database size: Feature vectors of 20,722 shapes (connected components).

Importance:

AVECINNA DATABASE

This database is built on a complete manuscript on the Persian philosopher Ibn Sina’ work, containing 300 pages:

Importance:

1) Bigger dataset for unsupervised learning.

2) Feature vector of 123,007 shapes.
3) Verification by the experts (McGill ISI)
4) Link  on IJCNN 2011 (Unsupervised and Transfer Learning Challenge): 

ACTIVE LEARNING CHALLENGE 2010

Active Learning Challenge 2010, as a part of the Pascal2 Challenge Program, targeted pool-based active learning in which a large unlabeled dataset is available from the onset of the challenge and the participants can place queries to acquire data for some amount of virtual cash. The participants will need to return prediction values for all the labels every time they want to purchase new labels. This will allow us to draw learning curves prediction performance vs. the amount of virtual cash spend.

UNSUPERVISED AND TRANSFER LEARNING CHALLENGE 2011

Unsupervised and Transfer Learning Challenge 2011 targeted classification problems, which are found in many application domains, including in pattern recognition (classification of images or videos, speech recognition), medical diagnosis, marketing (customer categorization), and text categorization (filtering of spam), using unsupervised and transfer learning algorithms.