Enhanced Protein Isoform Characterization Through Long-Read Proteogenomics - Workflow Results

From MaRDI portal



DOI10.5281/zenodo.5987905Zenodo5987905MaRDI QIDQ6724604

Dataset published at Zenodo repository.

Author name not available (Why is that?)

Publication date: 30 January 2022

Copyright license: No records found.



The detection of physiologically relevant protein isoforms encoded by the human genome is critical to biomedicine. Mass spectrometry (MS)-based proteomics is the preeminent method for protein detection, but isoform-resolved proteomic analysis relies on accurate reference databases that match the sample; neither a subset nor a superset database is ideal. Long-read RNA sequencing (e.g. PacBio, Oxford Nanopore) provides full-length transcript sequencing, which can be used to predict full-length proteins. Here, we describe a long-read proteogenomics approach for integrating matched long-read RNA-seq and MS-based proteomics data to enhance isoform characterization. We introduce a classification scheme for protein isoforms, discover novel protein isoforms, and present the first protein inference algorithm for the direct incorporation of long-read transcriptome data in protein inference to enable detection of protein isoforms that are intractable to MS detection. We have released an open-source Nextflow pipeline that integrates long-read sequencing in a proteomic workflow for isoform-resolved analysis. Companion Repositories: Long-Read-Proteogenomics Workflow GitHub Repository Release Long-Read-Proteogenomics Analysis GitHub Repository Release Companion Datasets Long-Read-Proteogenomics Workflow Sample and Reference Data TEST Data for Long-Read-Proteogenomics Workflow GitHub Actions This Repository contains the complete output from the execution of theLong-Read-Proteogenomics Workflow, using the input fromJurkat Samples and Reference Data. The filejurkat.flnc.bamwas 6.5 GB had to be split into 13 separate files and for use should be rejoined -- here are the steps that were used to split the file up. 1. Convertjurkat.flnc.bam(binary format) to sam file (text format) without header:samtools view jurkat.flnc.bam jurkat.flnc.sam 2. Capture the header:samtools view -H jurkat.flnc.bam jurkat.flnc.header.sam 3. Splitjurkat.flnc.saminto smaller files (aim to get final size under 2GB):split -l 400000 jurkat.flnc.sam jurkat.flnc.chunk. 4. Convert each of these files back to bam for uploading:samtools view -b jurkat.flnc.chunk.a* -o jurkat.flnc.chunk.a*.bam (*=a,b,c,d,e,f,g,h,i,j,k,l,m) After downloading, reverse this process including using the header file which is found in theLRPG-Manuscript-Results-results-results-jurkat-isoseq3-companion-files.tar.gz file 1. Convert the bam files back to sam files:samtools view jurkat.flnc.chunk.a*.bam jurkat.flnc.chunk.a*.sam (*=a,b,c,d,e,f,g,h,i,j,k,l,m) 2. Combine the header together with the sam files:cat jurkat.flnc.chunk.a*sam jurkcat.flnc.sam (verified the same number of lines of the sam files is identical to the number of lines of the original without header: 4,956,761. Header file is 13 lines. 3. Convert to bam files if desired:samtools view -b jurkat.flnc.sam -o jurkat.flnc.bam 4. Rehead with the header file:samtools reheader -P -i jurkat.flnc.header.sam jurkat.flnc.bam






This page was built for dataset: Enhanced Protein Isoform Characterization Through Long-Read Proteogenomics - Workflow Results