Genomic datasets used for evalution of k-mer representations and indexes

DOI10.5281/zenodo.14722244Zenodo14722244MaRDI QIDQ6701350

Dataset published at Zenodo repository.

Author name not available (Why is that?)

Publication date: 23 January 2025

Copyright license: No records found.

This record contains genomic datasets, including subsampled k-mer sets for some datasets (files with names containing _subsampled_). Namely, it provides the following datasets: Two E. coli pan-genomes, obtained as the union of the E. coli genomes from the 661k collection. One contains all genomes (without quality filtering) and for the other (HQ) we applied high-quality filtering. S. pneumoniae pan-genome: 616 genomes, as provided in RASE DB S. pneumoniaehttps://github.com/c2-d2/rase-db-spneumoniae-sparc/ SARS-CoV-2 pan-genome, downloaded from GISAID https://gisaid.org/ (access upon registration)on Jan 25, 2023 (GISAID version 2023/01/23, 14,682,066 genomes, 430 Gbp). Metagenomic sample SRS063932 (Illumina raw reads) of human microbiome with accession SRX023459, download from https://www.hmpdacc.org/hmp/HMASM/. The fastq files were converted to FASTA files using `seqtk seq -A -C`. Human RNA-seq Illumina raw reads with accession SRX348811, downloaded using the prefetch tool from the SRA toolkit and then converted into the FASTA format by`fastq-dump --split-3 --fasta`. Human genome Illumina raw reads with accession SRX016231, downloaded using the prefetch tool from the SRA toolkit and then converted into the FASTA format by`fastq-dump --split-3 --fasta`. Human genome assembly chm13v2.0 (T2T), downloaded fromhttps://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz. Two MiniKraken datasets (4GB and 8GB), downloaded from https://ccb.jhu.edu/software/kraken/, withthe 31-mers dumped using Jellyfish 1.1.12. The resulting FASTA files (apart from the human genome assembly chm13v2.0 and MiniKraken datasets) were converted to unitigs by GGCAT v1.1.0by `ggcat build -k {kmer-size} -m 200 -j 5 -s {min-freq} -o {preprocessed_unitigs} {input_FASTA}`, where we used $k=128$ and `{min-freq}`=1 for pan-genomes and $k=32$ and `{min-freq}`=2 for dataset from raw reads. Finally, the subsampled files `{dataset}_subsampled_k{$k$}_r0.1.fa.xz` contain 10% randomly chosen distinct canonical $k$-mers from the whole $k$-mer set of the given dataset. The FASTA file contains one subsampled k-mer per sequence.

This page was built for dataset: Genomic datasets used for evalution of k-mer representations and indexes