To map the human protein subset or parts list coded by genes on each chromosome.
- Starting Point: Missing Proteins (>30%) Mapping
- End Point: Complete list of all proteins coded by genes in all human chromosomes.
- The mapping will be complemented by studies of cellular and organ expression, subcellular distribution, functions, PTM and interactions, under specific physiological settings, and alterations of expression and isoforms associated with pathophysiology.
The initial goals of C-HPP are to identify at least one representative proteins encoded by each of the 21823 human genes (Ensembl, Apr 2011) and organize the data in accordance with the chromosomal structures. A second goal is to identify peptides for each protein which is observed in a wide range of samples and is suitable for preparation of reference peptides. A third goal is to perform tissue localization and quantitative studies using either mass spectrometric approach and/or antibody captured reagents. A fourth goal is to identify the set of isoforms coded from each human gene, such as alternative splicing and nsSNPs.
A typical experimental workflow of missing protein identification by C-HPP teams
Stage 1. Experimental procedures for data production (Steps 1-5)
- Step 1 is to make a list of “missing proteins” using the several DBs (e.g., UniProt, Ensembl, GPMDB) by cross checking with an entire list of protein coding genes. At the same time, another effort will be made to improve the quality of mass spectrometric identifications in all but the highest probability category
- Step 2 will be to obtain specific mRNA expression pattern by RNA-seq and reverse transcription-polymerase chain reaction, based on public databases (GeneCards [www.genecards.org] and dbEST [www.ncbi.nlm.nig.gov/dbEST]) with defined expression thresholds. For these transcriptomic analyses, we will work with a group of cell biologists who have various specific cell lines and stem cells which may provide very unusual, rarely expressed proteins that are hard to detect under normal culture conditions
- Step 3 will be to characterize at least one representative isoform and three major translational modifications (PTMs) (i.e., phosphoryl-, glycosyl-, and acetyl-) for each protein.
- Step 4 will be to explore the annotation and disease related context of newly identified proteins in collaboration with the B/D project.
- Step 5 will be to perform the validation works including proteomic profiling with re-identification, quantifications and cross-validation (SRM and Ab captured methods).
Stage 2: Quality control of data and submission procedures (Steps 6-8)
- Step 6 will be to normalize the dataset to select the highest degree of confidence. Given the presence of two types of proteotypic peptides and ambiguously mapped peptides (www.proteomecenter.org), a more sensitive unambiguous, and quantitative assay should be developed for a protein. We will use SRMs as main engines for quantitation since the SRMAtlas is already available and provides suggested transitions and collision energy settings, observed retention times, calculated hydrophobicity, and information concerning peptide fragments.
- Step 7 will be to set the standard operational procedures for data handling. As we learned from the exploratory phase of the Human Plasma Proteome Project to consistently characterize proteins by MS methods, we should also put significant efforts into quality control and ensure commitment to deep analysis of specimens.
- Step 8 will be to build the C-HPP databases and utilize the data for biology and disease research in collaboration with B/D-HPP group. Findings and results generated by an individual group will belong to the corresponding investigators for publication. However, the results from each group should be published or deposited in the central C-HPP databases. It is also necessary to encourage all PIs to offer an opportunity of contribution and even re-analysis to the B/D and C-HPP teams.