Main dataset for the Large-scale analysis of the β-lactamase sequence space with protein language models (Q6706224)
From MaRDI portal
| This is the item page for this Wikibase entity, intended for internal use and editing purposes. Please use this page instead for the normal view: Main dataset for the Large-scale analysis of the β-lactamase sequence space with protein language models |
Dataset published at Zenodo repository.
| Language | Label | Description | Also known as |
|---|---|---|---|
| English | Main dataset for the Large-scale analysis of the β-lactamase sequence space with protein language models |
Dataset published at Zenodo repository. |
Statements
The main dataset for the publication "Large-scale analysis of the -lactamase sequence space with protein language models". This dataset contains 29,445 rows and 82 columns and is provided in parquet format. The rows represent all sequences retrieved from the BLDB. The columns contain information processed from the BLDB, including their taxonomy annotated against the Genome Taxonomy Database (GTDB RS207), the per-protein embeddings derived from five protein language models (ESM-1b, ESM2-650, ESM2-3b, CARP-640M, ProtTrans-t5-xl-u50), functional annotations estimated with Biopython, sequence quality filters applied to select sequences for the analysis, annotations from the AlphaFold Database (AFDB) for the available structures, and the secondary structure annotations generated from the predicted structures by AlphaFold2 using pyDSSP. The 2-dimensional representations of PCA, t-SNE, and UMAP for the evaluated protein language models are provided as datasets in CSV and Parquet formats. The algorithm used and the specific set of beta-lactamases are indicated at the beginning of the filename: sbl for serine beta-lactamases and mbl for metallo-beta-lactamases. For more information, consult the following Github repository https://github.com/miangoar/Betalactamase-analysis-with-machine-learning
0 references
26 January 2025
0 references