Data and code for 'Pseudogenes act as a neutral reference for detecting selection in prokaryotic pangenomes'

DOI10.5281/zenodo.8326664Zenodo8326664MaRDI QIDQ6683175

Dataset published at Zenodo repository.

Author name not available (Why is that?)

Publication date: 7 September 2023

Copyright license: No records found.

This repository contains the code and files for reproducing the analyses and results reported in 'Pseudogenes act as a neutral reference for detecting selection in prokaryotic pangenomes' by Gavin M. Douglas and B. Jesse Shapiro(https://doi.org/10.1038/s41559-023-02268-6). File organization and descriptions: code/ - Contains GitHub repository releases of code used in manuscript (the other folders contain datafiles only). This code is provided here as well as on GitHub to ensure long-term access. handy_pop_gen-1.1.0/ - release v1.1.0 of the convenience repository (used for specific data processing and analysis steps referred to in the manuscript). pangenome_pseudogene_null-1.1.0/- Main code repository for manuscript. broad_pangenome_analysis/ element_info/element_counts.tsv.gz - Counts of (filtered) pseudogenes and intact genes called per genome accession. element_info/gene_sizes.tsv.gz - Gene sizes in base-pairs. element_info/pseudogene_sizes.tsv.gz - Filtered pseudogene sizes in base-pairs. element_info/element_percent_coverage/*tsv.gz - Tables containing the percent genome coverage of genes and pseudogenes, by accession and averaged over accessions per species separately. example_Mycoplasmopsis_bovis_panaroo_output.csv.gz - Panaroo output table for Mycoplasmopsis bovis, which was used for an example. Corresponds to thegene_presence_absence.csvfile in the raw Panaroo output. focal_and_non.focal_full_to_short.tsv.gz - Mapfile of full to short (and unique) species ids used in analysis. Primarily to include species ids in cluster names without making them unnecessarily long. genome_info/accessions.tsv.gz - Genome accessions used for broad pangenome analysis (note that not all genome accessions could be downloaded [and were ignored], which is indicated in the "could_download" column). genome_info/genome_sizes.tsv.gz - Sizes of all genomes used for the broad pangenome analysis. metrics_additional_subsamples.tsv.gz - Contains columns also found in the pangenome_and_related_metrics.tsv.gzfile below, but based on genome subsamplings of 3 and 20, rather than 9. model_output/pangenome_linear_models.rds - R Data Serializationfiles containing theoutput of R linear model objects (generated by lm and provided as an R list object). There are separate elements in the list for the mean number of genes, genomic fluidity, percentagesingletons (si), and si/sp. model_output/linear_model_coef.tsv.gz - Coefficient summary table for all linear models. pangenome_and_related_metrics.tsv.gz - Metrics used for broad pangenome analysis across 670 prokaryotic species. Note that this table was filtered down to 668 species after excluding those with 9 genomes. pangenome_and_related_metrics_filt.tsv.gz - Filtered table, as described above. taxonomy.tsv.gz - Taxonomy for all species used for this analysis, taken from GTDB. Row names are species names. indepth_10_species_analysis/ cluster_breakdown_tables/ - Folder containing tables providing breakdown of how clusters are distributed by element type, pangenome partition, and species. Provided for easy plotting. cluster_COG_annot.tsv.gz - Mapping of cluster IDs to COG annotations. cluster_filt_lengths_and_additional.tsv.gz - Metadata on clusters, most pertinently the length of the representative sequence in the cluster (which was used to filter out some clusters, below the cut-off which pseudogenes could not be called). cluster_member_breakdown.tsv.gz - Table providing information on each element (called pseudogenes and intact genes) and provides information such as what cluster they are part of, what species and genome accession they are found in, etc. cluster_types.rds - R Data Serialization file containing R list providing breakdown of all clusters into categories (intact/pseudogene/mixed, where mixed means containing both pseudogene and intact elements). COG_enrichment_results/ultra.cloud-COG-gene-enrichments.tsv.gz - Output file with enrichment test summaries for COG IDs in significant COG categories, which was run for the ultra-cloud pangenome partition model only. element_glmm_input.tsv.gz - Table containing all information used for fitting generalized linear mixed models. focal_species.txt - Names of species used for the in-depth analysis. genome_info/ - Folder containing the genome accessions (and the corresponding genome sizes) for all ten analyzed species. glmm_output/ - Folder containing R Data Serialization files containing output R objects after fitting generalized linear mixed models (only ultra-rare files are present, due to file size constraints). per_genome_element.type_percent_coverages.rds - R Data Serializationfile containing R list providing the percent coverage by intact genes vs pseudogenes per accession (nested by species)

This page was built for dataset: Data and code for 'Pseudogenes act as a neutral reference for detecting selection in prokaryotic pangenomes'