Track 2

Large Vocabulary Arabic Handwritten Online Character Recognition Competition

Motivation

The need for reliable, robust and flexible Arabic online handwritten character recognition systems is growing

rapidly with the new generations of mobile and handheld devices. Promising applications for such technology are in educational domain and business domain, as well as data gathering and other domains.

No large vocabulary standard dataset is available for Arabic online handwriting to train and evaluate systems. The Arabic Language TEchnology Center (ALTEC) is producing a large dataset (1000 writers and 5,000 pages) for Arabic online handwritten documents which is intended for training and benchmarking large vocabulary systems.

We invite all companies and research groups to participate in this competition, which would have four main benefits. 1- To have a standard testing dataset where all systems can be compared against, as well as a large training corpus which can be used by industry and academia. 2- To be able to measure the state of the art of Arabic online handwritten technology.       3- To have the opportunity to compare different approaches and ideas that would ultimately benefit the Arabic online handwritten industry. 4- Finally, to give the opportunity to research groups and developing companies to have the well designed large training data which would also benefit the industry.

Data Sets

Training Database

The ALTEC Arabic Handwriting Corpus will be available for all the participating groups in the competition.

This database consists of 5,000 pages that include approximately 35,000 lines. Each line includes one sentence. The whole database includes around 175,000 words that consists of approximately 500,000 paws and about 1 million characters. The database is collected from 1000 different writers with average 5 pages per each writer. The database includes samples of 17,000 unique paws that are the most frequent paws in the Arabic language and represent around 95% coverage for the Arabic language1. The database includes samples from the common punctuation  marks  and  all  the  numerals.  The  database  will  be  available  in  InkXML  standard  format.  A document describing the used annotation tags and conventions will be distributed with the database.   The

database will be available to all participants on July 15th 2011.

Test Dataset

The test dataset will be available to all participants on August 15th 2011. The participants should run their recognition engines on the test data, and produce corresponding text outputs for each line in the given data.

Each participating entity can have up to three different engines.

No adaptation on test data is allowed. No training on test data is allowed.

It is allowed to use a dictionary with the list of the words in the corpus. The results should be delivered to the

competition committee exactly 72 hours from the test data availability. The results should be delivered to the competition committee by August 18th 2011. A draft paper is expected from each participant to describe the engines used and the obtained results.

Evaluation Process

  • Each participant group may submit up to three results using three different engines.
    • Participating groups will be anonymous. A code number will be given to each candidate engine, and the results will be announced by the code number only.
    • For the sake of fairness and authenticity, each participant will be asked to bring their engines to ALTEC at a specific date (August 20th 2011) to make random tests on sample data to compare with their submitted results.
    • In case a group cannot come to ALTEC premises (for being abroad) they are asked to send us their engine to make that authenticity test.

Important dates:

1-   All participants should register for the competition by June 15th   2011.

2-   The training data will be available to registered participants on July 15th  2011, after signing copyright and confidentiality forms.

3-   The test data will be available on August 15th    at 12pm (Egypt time).

4-   The recognition results should be sent to the committee before August 18th  2011 at 12pm (Egypt time).

5-   The competition results and winners will be announced by September 10th  2011.

6-   Draft paper is expected on September 18th    detailing the techniques used.

7-   The participants are encouraged to attend the workshop that will be held during the ALTIC conference

(October 9-10 2011)  www.altec-center.org to discuss the competition results.

8-   More details about the formats of the training and test data, and the submission format will be announced on the ALTEC web site: www.altec-center.org on May 15th  2011.

Contact Information:

Dr. Sherif Abdou

Sherif[dot]Abdou[at]altec-center[dot]org

Dr. Mohamed Waleed Fakhr

Waleed[dot]Fakhr[at]altec-center[dot]org