Synthetic datasets of the UK Biobank cohort

DOI10.5281/zenodo.13983170Zenodo13983170MaRDI QIDQ6722639

Dataset published at Zenodo repository.

Author name not available (Why is that?)

Publication date: 23 October 2024

Copyright license: No records found.

This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort. The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort. Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such. The original datasets are described in the article byVanoli et al in Epidemiology (2024) (DOI: 10.1097/EDE.0000000000001796) [freely available here], which also provides information about the data sources. The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1). Content The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following: synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants. synthbdbasevar: baseline variables, mostly collected at recruitment. synthpmdata: annual average exposure to PM2.5 for each participant reconstructed using their residential history. synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code. In addition, this repository provides these additional files: codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database. asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid). Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source]. Generation of the synthetic data The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables,annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode). The first part merges all the data including the annual PM2.5 levels in a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article. This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables as well as the mortality risks resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.

This page was built for dataset: Synthetic datasets of the UK Biobank cohort