Easy ORCID

From MaRDI portal



DOI10.5281/zenodo.13333068Zenodo13333068MaRDI QIDQ6723876

Dataset published at Zenodo repository.

Author name not available (Why is that?)

Publication date: 17 August 2024



The first-party ORCID data dump uses a data structure that is overly complex for most use cases. This Zenodo record contains a derived version that is much more straightforwards, accessible, and smaller. So far, this includes employers, education, external identifiers, and publications linked to PubMed. It adds additional processing to ground employers and educational instutitions using the Research Organization Registry (ROR). It also does some minor string processing, such as standardization of education types (e.g., Bachelor of Science, Master of Science) and standardization of PubMed references. Records Therecords.jsonl.gz file is a JSON Lines file where each row represents a single ORCID record in a simple, well-defined schema (see schema.json).The records_hq.jsonl.gz file is a subset of the full records file that only contains records that have at least one ROR-grounded employer, at least one ROR-grounded education, or at least one publication indexed in PubMed. The point of this subset is to remove ORCID records that are generally not possible to match up to any external information. This record also contains a SQLite database orcid.db that contains tables for researchers and for organizations. This is useful for quick lookup of data based on an ORCID local unique identifier. Employers, educational institution, and memberships that couldn't be grounded to an ROR record are listed in affiliation_missing_ror.tsv. Nomenclature Authority Cross-References Websites, social links, and other identifiers are parsed and standardized to comply withthe Bioregistry then shared using theSimple Standard for Sharing Ontological Mappings (SSSOM) in the sssom.tsv.gz file. This allows for getting Scopus, Web of Science, GitHub, Google Scholar, and other profiles for records that include them. This information is also available through the main records file. Authorship Links Authorships are extracted and standardized in thepubmeds.tsv.gz file, which contains an ORCID column and PubMed column that has been pre-sanitized to only contain local unique identifiers. This information is also available through the main records file. Lexical Indexes It includes two pre-built Gilda indexes for named entity recognition (NER) and named entity normalization (NEN). One contains all records, and the second is filtered to high-quality records. The following Python code snipped can be used for grounding: from gilda import Grounder url = "https://zenodo.org/records/11474470/files/gilda_hq.tsv.gz?download=1" grounder = Grounder(url) results = grounder.ground("Charles Tapley Hoyt") Ontology Artifacts The file orcid.ttl.gz is an OWL-ready RDF file that can be opened in Protg or used with the Ontology Development Kit. It can also be converted into OWL XML, OWL Functional Notation, or other OWL formats using ROBOT. This artifact can serve as a replacement for the ones generated byhttps://github.com/cthoyt/orcidio, which was a smaller-scale way of turning ORCID records for contributors to OBO Foundry Ontologies into a small OWL file. Now, the export here contains all ORCID records with names. Reproduction It is automatically generated with code inhttps://github.com/cthoyt/orcid_downloader.






This page was built for dataset: Easy ORCID