PUCIT-OHUL: PUCIT Offline Handwritten Urdu Lines Dataset
The PUCIT-OHUL dataset is a multiple writing style Urdu dataset written by 100 writers. The collection of the dataset was undertaken by Computer Vision and Machine Learning Group, Punjab University College of Information Technology (PUCIT), Lahore, Pakistan under the supervision of Tayaba Anjum and Nazar Khan. The dataset can be used in different research areas related to text and handwriting recognition. Version 1.0 of the dataset is freely available for researchers.
The dataset was collected as part of a project on attention based recognition of offline handwritten Urdu text.
Overview
- 7,309 offline handwritten Urdu text lines extracted from handwritten text pages.
- 78,870 words written by 100 different writers.
- 98 unique Urdu characters.
- Pages were scanned at 200 resolution.
- 100 undergraduate students between the ages 20-24 years were requested to submit any handwritten Urdu text, and a corresponding ground-truth text file.
- No restriction on pen type, page type, and ink colour.
- No restriction on what to write.
- Pages scanned at 200 DPI and text lines manually segmented.
- Text line images have not been deskewed.
- Submitted ground-truth was thoroughly checked and corrected/completed by a team of 3 persons.
Download
The dataset is publicly available and can be downloaded here.
UPDATE (October 2021):
Ground-truth annotations have been updated. Please use train_labels_v2.xlsx and test_labels_v2.xlsx.
Terms of Use
This dataset can be used for non-commercial research purpose only. If you publish material based on this dataset, we request you to include references to the following papers.
1) Tayaba Anjum and Nazar Khan, An attention based method for offline handwritten Urdu text recognition, 17th International Conference on Frontiers of Handwriting Recognition (ICFHR 2020), Sep 7-10, 2020.
2) Tayaba Anjum and Nazar Khan, CALText: Contextual Attention Localization for Offline Handwritten Text, Neural Processing Letters, 2023.
Bibtex
@inproceedings{anjum2020urdu_ohtr, author = {Anjum, Tayaba and Khan, Nazar}, title = {{An attention based method for offline handwritten Urdu text recognition}}, booktitle = {International Conference on Frontiers in Handwriting Recognition (ICFHR)}, month = {September}, year = {2020} }
@article{anjum2023caltext, title={CALText: Contextual Attention Localization for Offline Handwritten Text}, author={Anjum, Tayaba and Khan, Nazar}, journal={Neural Processing Letters}, volume={}, number={}, pages={}, month={April}, year={2023}, publisher={Springer, NY}, url={https://doi.org/10.1007/s11063-023-11258-5} }
Contact
In case of any queries, please contact at given email addresses:
- nazarkhan@pucit.edu.pk
- phdcsf14m005@pucit.edu.pk
Aknowledgments
This work was supported by the Higher Education Commission (Pakistan) under Grant 8329/Punjab/NRPU/R&D/HEC/2017. We thank the undergraduate students at University of Management and Technology (UMT) for agreeing to become scribes for our dataset. We also thank Bilal Rasheed and Faizan Saleem for the tedious process of ground-truth annotations and corrections. The updated annotations were made possible with the help of Adeela Islam, Asmat Batool, Rabia Sirhindi, Tauseef Iftikhar and Abubakar Siddique.