Natural Scene Character Recognition Datasets

Datasets description

There are four publicly available datasets, i.e. Char74K dataset [1], ICADAR 2003 robust character recognition dataset [2], IIIT5K set [3] and Street View Text (SVT) dataset [4]. We only focus on recognition of English characters, which are composed of 62 classes, i.e. digits 0~9, English letters in upper case A~Z, and lower case a~z. All the images in the experiments are resized into 32*32 pixels and all the images should be first transformed into a gray scale images.


Char74K_15 dataset

The character images of the Char74K dataset [1] are mostly photographed from sign boards,hoardings and advertisements and a few images of products are in supermarkets and shops. As for the Char74K dataset, we only use the English character images cropped from the natural scene images. The English dataset has 12503 characters, of which 4798 were labeled as bad images due to excessive occlusion, low resolution or noise. It contains 62 character classes.A small subset with a standard partition is used in our experiments, i.e. Char74K-15, which contains 15 training samples per class and 15 test samples per class.

ICDAR2003 Dataset

The ICDAR2003 [2] robust reading dataset was collected for the robust reading competition of scene text detection and recognition. The scene character dataset ICDAR03-CH contains more than 11,500 character images. It is a very difficult dataset because of serious non-text background outliers with the cropped character samples, and many character images have very low resolution. There are 6,185 training characters and 5,430 testing characters, respectively. We excluded some special characters such as '!', and then the final experimental dataset consists of 6,113 characters for training and 5,379 characters for testing.

IIIT5K Dataset

The IIIT 5K-word (IIIT5K) [3] is composed of 2,000 and 3,000 images for testing and training, respectively. The images contain both scene text images and born-digital images. The character image dataset, which is extracted from word images, is composed of 9,678 samples and 15,269 samples for training and testing, respectively.

SVT Dataset

The Street View Text (SVT) [4] was collected from Google Street View of road-side scenes. All the images are very challenging because of the large variations in illumination, character image sizes, and variety of font sizes and styles. The SVT character dataset, which was annotated in [5],is utilized for evaluating different scene character recognition methods. This dataset consists of 3,796 character samples from 52 categories (no digit images). SVT character dataset is more difficult to recognize than the ICADAR2003 dataset.


Features Extraction
We first exploit the robust PCA [6] to recover clean character images from the blurred or corrupted images (original images), and then use the HOG [7] method to extract gradient features from the recovered image. The features are used for character recognition.

Experiments Configuration

In the SVT dataset, only the test samples are available for character recognition. Considering that the SVT character dataset has similar distribution to ICDAR2003 and Char74K, we integrated the Char74K EnglishImg dataset and the training samples of ICADAR2003 to construct a new training dataset, which has more than 18600 characters.
We exploited the new training dataset to recognize the test samples of ICDAR2003 and SVT datasets. Because the SVT dataset does not have digit images, we first exclude the digit images from the new training dataset. For the datasets of Char74K_15 and IIIT5K, we use their respective training and test datasets for evaluation. For more information, please refer to our paper.

The image features used in our paper are available at:

Link1: (BaiduLink) Password:ynhs; Link2: (GoogleLink)

If you use the data features here, please cite our paper below.
All the baseline recognition results on different datasets can be found from our paper:

[1] Z. Zhang, Y. Xu, C.L. Liu, Natural Scene Character Recognition Using Robust PCA and Sparse Representation,
12th IAPR International Workshop on Document Analysis Systems (DAS), 2016, DOI 10.1109/DAS.2016.32.

[2] Z. Zhang, Y. Xu, J. Yang, X. Li, D. Zhang, A Survey of Sparse Representation: Algorithms and Applications, IEEE Access,3, 490-530,2015

Copyright: Zheng Zhang, Yong Xu, 2016.06.12
If you have any question, please contact us:,


Copyright Notice
The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright.

[1] T. E. de Campos, B. R. Babu, M. Varma, "Character Recognition in Natural Images," in Proceeding of VISAPP, Lisboa, Portugal, vol. 2, pp. 273-280, 2009.
[2] S. M. Lucas, A. Panaretos, L. Sosa, A. Tamg, S. Wong and R. Young, "ICDAR 2003 robust reading competitions," in Proceeding of 7th ICDAR, pp. 682-687, 2003.
[3] A. Mishra, K. Alahari, C. V. Jawahar, "Scene text recognition using higher order language priors," in Proceeding of 23rd BMVC, Surrey, United Kingdom, 2012.
[4] K. Wang, B. Babenko, S. Belongie, "End-to-end scene text recognition," in Proceeding of 2011 ICCV, Barcelona, Spain, pp. 1457-1464, 2011.
[5] A. Mishra, K. Alahari, C.V. Jawahar, "Top-down and bottom-up cues for scene text recognition," in Proceeding of 2012 CVPR, Providence, USA, pp. 2687-2694, 2012.
[6] E. J. Candès, X. Li, Y. Ma, J. Wright, "Robust principal component analysis?" Journal of the ACM, vol. 58(3), pp. 11, 2011.
[7] N. Dalal, B. Triggs, "Histograms of oriented gradients for human detection," in Proceeding of CVPR, San Diego, USA, vol. 1, pp. 886 - 893, 2005.