Noisy OCR Dataset (NOD)

DOI10.5281/zenodo.5068735Zenodo5068735MaRDI QIDQ6717928

Dataset published at Zenodo repository.

Author name not available (Why is that?)

Publication date: 6 July 2021

Copyright license: No records found.

This dataset contains 18,504 images of English and Arabic documents with ground truth for use in OCR benchmarking. It consists of two collections, Old Books (English) and Yarmouk (Arabic),each of which contains an image set reproduced in 44 versions with different types and degrees of artificially generated noise. The dataset was originally developed forHegghammer (2021). Source images The seed of the English collection was the Old Books Dataset (Barcha 2017), a set of 322 page scans from English-language books printed between 1853 and 1920. The seed of the Arabic collection was a randomly selected subset of 100 pages from the Yarmouk Arabic OCR Dataset (Abu Doush et al. 2018), which consists of 4,587 Arabic Wikipedia articles printed to paper and scanned to PDF. Artificial noise application The dataset was created as follows: - First a greyscale version of each image was created, so that there were two versions (colour and greyscale) with no added noise. - Then six ideal types of image noise --- blur, weak ink, salt and pepper, watermark, scribbles, and ink stains --- were applied both to the colour version and the binary version of the images, thus creating 12 additional versions of each image. The R code used to generate the noise is included in the repository. - Lastly, all available combinations of *two* noise filters were applied to the colour and binary images, for an additional 30 versions. This yielded a total of 44 image versions divided into three categories of noise intensity: 2 versions with no added noise, 12 versions with one layer of noise, and 30 versions with two layers of noise. This amounted to an English corpus of 14,168 documents and an Arabic corpus of 4,400 documents. The compressed archive is ~26 GiB, and the uncompressed version is ~193 GiB.See this link forhow to unzip .tar.lzma files. References: Barcha, Pedro. 2017. Old Books Dataset. GitHub Repository. GitHub. https: //github.com/PedroBarcha/old-books-dataset. Doush, Iyad Abu, Faisal AlKhateeb, and Anwaar Hamdi Gharibeh. 2018. Yarmouk Arabic OCR Dataset. In 2018 8th International Conference on Computer Science and Information Technology (CSIT), 15054. IEEE. Hegghammer, Thomas. 2021. OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment. Socarxiv. https://osf.io/preprints/socarxiv/6zfvs

This page was built for dataset: Noisy OCR Dataset (NOD)