Main dataset for the Large-scale analysis of the β-lactamase sequence space with protein language models (Q6706224)

Dataset published at Zenodo repository.

Language	Label	Description	Also known as
English	Main dataset for the Large-scale analysis of the β-lactamase sequence space with protein language models	Dataset published at Zenodo repository.

Statements

instance of

data set

0 references

description

The main dataset for the publication "Large-scale analysis of the -lactamase sequence space with protein language models". This dataset contains 29,445 rows and 82 columns and is provided in parquet format. The rows represent all sequences retrieved from the BLDB. The columns contain information processed from the BLDB, including their taxonomy annotated against the Genome Taxonomy Database (GTDB RS207), the per-protein embeddings derived from five protein language models (ESM-1b, ESM2-650, ESM2-3b, CARP-640M, ProtTrans-t5-xl-u50), functional annotations estimated with Biopython, sequence quality filters applied to select sequences for the analysis, annotations from the AlphaFold Database (AFDB) for the available structures, and the secondary structure annotations generated from the predicted structures by AlphaFold2 using pyDSSP. The 2-dimensional representations of PCA, t-SNE, and UMAP for the evaluated protein language models are provided as datasets in CSV and Parquet formats. The algorithm used and the specific set of beta-lactamases are indicated at the beginning of the filename: sbl for serine beta-lactamases and mbl for metallo-beta-lactamases. For more information, consult the following Github repository https://github.com/miangoar/Betalactamase-analysis-with-machine-learning

0 references

publication date

26 January 2025

0 references

author

Miguel Ángel González Arias

0 references

Lorenzo Segovia

0 references

Alejandro Garciarrubio

0 references

copyright license