DECIMER V2 Benchmark Datasets
DOI10.5281/zenodo.8139328Zenodo8139328MaRDI QIDQ6710330
Dataset published at Zenodo repository.
Author name not available (Why is that?)
Publication date: 12 July 2023
Copyright license: No records found.
A comprehensive benchmark of the DECIMER Image Transformer was conducted using all publicly available OCSR benchmark datasets and DECIMER test datasets. USPTO: A set of 5,719 images of chemical structures and the corresponding MOL files (US Patent Office) obtained from the OSRA online presence UOB: The dataset of 5,740 images and MOL files of chemical structures developed by the University of Birmingham, United Kingdom, and published alongside MolRec CLEF: The Conference and Labs of the Evaluation Forum test set of 992 images and molfiles published in 2012 JPO: A subset (450 images and MOL files) of a dataset based on data from the Japanese Patent Office, obtained from the OSRA online presence. Note that this dataset contains many labels (sometimes with Japanese characters) and irregular features, such as variations in the line thickness. Additionally, some images have poor quality and contain a lot of noise. RanDepict250k: A set of 250,000 chemical structure depictions generated with RanDepict (1.0.8) using RanDepicts depiction feature fingerprints to ensure diverse depiction parameters. None of the depicted molecules is present in the DECIMER training data. The images here are all 299 x 299 pixels in size. RanDepict250k_augmented: A set of the same 250,000 images from the RanDepict250k dataset. Additional augmentations (examples: mild rotation, shearing, insertion of labels and reaction arrows around the structures, insertion of curved arrows in the structure) were added to the images using RanDepict. The images here are all 299 x 299 pixels in size. DECIMER hand-drawn: A set of 5,088 chemical structure depictions which were manually drawn by a group of 24 volunteers. The drawn molecules have been picked using the MaxMinalgorithm from all molecules in PubChemso that the set represents a big part of the chemical space. Indigo: 50,000 images generated by Staker et al. using Indigowhich were collected from the supplementary information. All images have a resolution of 224 x 224 pixels. USPTO_big: 50,000 images from the USPTO from Staker et al.which were collected from the supplementary information. All images have a resolution of 224 x 224 pixels. Img2Mol test set: A set of 25,000 chemical structure depictions used by Clvert et al. for testing . All images have a resolution of 224 x 224 pixels.
This page was built for dataset: DECIMER V2 Benchmark Datasets