Articles Online (Volume 11, Issue 2)

Original Research

Interpretation, Stratification and Evidence for Sequence Variants Affecting mRNA Splicing in Complete Human Genome Sequences

Ben C. Shirley, Eliseos J. Mucaki, Tyson Whitehead, Paul I. Costea, Pelin Akan, Peter K. Rogan

Information theory-based methods have been shown to be sensitive and specific for predicting and quantifying the effects of non-coding mutations in Mendelian diseases. We present the Shannon pipeline software for genome-scale mutation analysis and provide evidence that the software predicts variants affecting mRNA splicing. Individual information contents (in bits) of reference and variant splice sites are compared and significant differences are annotated and prioritized. The software has been implemented for CLC-Bio Genomics platform. Annotation indicates the context of novel mutations as well as common and rare SNPs with splicing effects. Potential natural and cryptic mRNA splicing variants are identified, and null mutations are distinguished from leaky mutations. Mutations and rare SNPs were predicted in genomes of three cancer cell lines (U2OS, U251 and A431), which were supported by expression analyses. After filtering, tractable numbers of potentially deleterious variants are predicted by the software, suitable for further laboratory investigation. In these cell lines, novel functional variants comprised 6–17 inactivating mutations, 1–5 leaky mutations and 6–13 cryptic splicing mutations. Predicted effects were validated by RNA-seq analysis of the three aforementioned cancer cell lines, and expression microarray analysis of SNPs in HapMap cell lines.

Page 77–85

Original Research

Basophile: Accurate Fragment Charge State Prediction Improves Peptide Identification Rates

Dong Wang, Surendra Dasari, Matthew C. Chambers, Jerry D. Holman, Kan Chen, Daniel C. Liebler, Daniel J. Orton, Samuel O. Purvine, Matthew E. Monroe, Chang Y. Chung, Kristie L. Rose, David L. Tabb

In shotgun proteomics, database search algorithms rely on fragmentation models to predict fragment ions that should be observed for a given peptide sequence. The most widely used strategy (Naive model) is oversimplified, cleaving all peptide bonds with equal probability to produce fragments of all charges below that of the precursor ion. More accurate models, based on fragmentation simulation, are too computationally intensive for on-the-fly use in database search algorithms. We have created an ordinal-regression-based model called Basophile that takes fragment size and basic residue distribution into account when determining the charge retention during CID/higher-energy collision induced dissociation (HCD) of charged peptides. This model improves the accuracy of predictions by reducing the number of unnecessary fragments that are routinely predicted for highly-charged precursors. Basophile increased the identification rates by 26% (on average) over the Naive model, when analyzing triply-charged precursors from ion trap data. Basophile achieves simplicity and speed by solving the prediction problem with an ordinal regression equation, which can be incorporated into any database search software for shotgun proteomic identification.

Page 86–95

Original Research

Structure-based Comparative Analysis and Prediction of N-linked Glycosylation Sites in Evolutionarily Distant Eukaryotes

Phuc Vinh Nguyen Lam, Radoslav Goldman, Konstantinos Karagiannis, Tejas Narsule, Vahan Simonyan, Valerii Soika, Raja Mazumder

The asparagine-X-serine/threonine (NXS/T) motif, where X is any amino acid except proline, is the consensus motif for N-linked glycosylation. Significant numbers of high-resolution crystal structures of glycosylated proteins allow us to carry out structural analysis of the N-linked glycosylation sites (NGS). Our analysis shows that there is enough structural information from diverse glycoproteins to allow the development of rules which can be used to predict NGS. A Python-based tool was developed to investigate asparagines implicated in N-glycosylation in five species: Homo sapiens, Mus musculus, Drosophila melanogaster, Arabidopsis thaliana and Saccharomyces cerevisiae. Our analysis shows that 78% of all asparagines of NXS/T motif involved in N-glycosylation are localized in the loop/turn conformation in the human proteome. Similar distribution was revealed for all the other species examined. Comparative analysis of the occurrence of NXS/T motifs not known to be glycosylated and their reverse sequence (S/TXN) shows a similar distribution across the secondary structural elements, indicating that the NXS/T motif in itself is not biologically relevant. Based on our analysis, we have defined rules to determine NGS. Using machine learning methods based on these rules we can predict with 93% accuracy if a particular site will be glycosylated. If structural information is not available the tool uses structural prediction results resulting in 74% accuracy. The tool was used to identify glycosylation sites in 108 human proteins with structures and 2247 proteins without structures that have acquired NXS/T site/s due to non-synonymous variation. The tool, Structure Feature Analysis Tool (SFAT), is freely available to the public at

Page 96–104

Original Research

Mass Spectrometry-based Quantitative Proteomic Profiling of Human Pancreatic and Hepatic Stellate Cell Lines

Joao A. Paulo , Vivek Kadiyala, Peter A. Banks, Darwin L. Conwell, Hanno Steen

The functions of the liver and the pancreas differ; however, chronic inflammation in both organs is associated with fibrosis. Evidence suggests that fibrosis in both organs is partially regulated by organ-specific stellate cells. We explore the proteome of human hepatic stellate cells (hHSC) and human pancreatic stellate cells (hPaSC) using mass spectrometry (MS)-based quantitative proteomics to investigate pathophysiologic mechanisms. Proteins were isolated from whole cell lysates of immortalized hHSC and hPaSC. These proteins were tryptically digested, labeled with tandem mass tags (TMT), fractionated by OFFGEL, and subjected to MS. Proteins significantly different in abundance (P < 0.05) were classified via gene ontology (GO) analysis. We identified 1223 proteins and among them, 1222 proteins were quantifiable. Statistical analysis determined that 177 proteins were of higher abundance in hHSC, while 157 were of higher abundance in hPaSC. GO classification revealed that proteins of relatively higher abundance in hHSC were associated with protein production, while those of relatively higher abundance in hPaSC were involved in cell structure. Future studies using the methodologies established herein, but with further upstream fractionation and/or use of enhanced MS instrumentation will allow greater proteome coverage, achieving a comprehensive proteomic analysis of hHSC and hPaSC.

Page 105–113

Original Research

Search and Analysis of Identical Reverse Octapeptides in Unrelated Proteins

Konda Mani Saravanan, Samuel Selvaraj

For the past few decades, intensive studies have been carried out in an attempt to understand how the amino acid sequences of proteins encode their three dimensional structures to perform their specific functions. In order to understand the sequence-structure relationship of proteins, several sub-sequence search studies in non-redundant sequence-structure databases have been undertaken which have given some fruitful clues. In our earlier work, we analyzed a set of 3124 non-redundant protein sequences from the Protein Data Bank (PDB) and retrieved 30 identical octapeptides having different secondary structures. These octapeptides were characterized by using different computational procedures. This prompted us to explore the presence of octapeptides with reverse sequences and to analyze whether these octapeptides would adopt similar structures as that of their parent octapeptides. Our identical reverse octapeptide search resulted in the finding of eight octapeptide pairs (octapeptide and reverse octapeptide) with similar secondary structure and 23 octapeptide pairs with different secondary structures. In the present work, the geometrical and biophysical characteristics of identical reverse octapeptides were explored and compared with unrelated octapeptide pairs by using various computational tools. We thus conclude that proteins containing identical reverse octapeptides are not very abundant and residues in the octapeptide pairs do not contribute to the stability of the protein. Furthermore, compared to unrelated octapeptides, identical reverse octapeptides do not show certain biophysical and geometrical properties.

Page 114–121

Application Note

SNVDis: A Proteome-wide Analysis Service for Evaluating nsSNVs in Protein Functional Sites and Pathways

Konstantinos Karagiannis, Vahan Simonyan, Raja Mazumder

Amino acid changes due to non-synonymous variation are included as annotations for individual proteins in UniProtKB/Swiss-Prot and RefSeq which present biological data in a protein- or gene-centric fashion. Unfortunately, proteome-wide analysis of non-synonymous single-nucleotide variations (nsSNVs) is not easy to perform because information on nsSNVs and functionally important sites are not well integrated both within and between databases and their search engines. We have developed SNVDis that allows evaluation of proteome-wide nsSNV distribution in functional sites, domains and pathways. More specifically, we have integrated human-specific data from major variation databases (UniProtKB, dbSNP and COSMIC), comprehensive sequence feature annotation from UniProtKB, Pfam, RefSeq, Conserved Domain Database (CDD) and pathway information from Protein ANalysis THrough Evolutionary Relationships (PANTHER) and mapped all of them in a uniform and comprehensive way to the human reference proteome provided by UniProtKB/Swiss-Prot. Integrated information of active sites, pathways, binding sites, domains, which are extracted from a number of different sources, provides a detailed overview of how nsSNVs are distributed over the human proteome and pathways and how they intersect with functional sites of proteins. Additionally, it is possible to find out whether there is an over- or under-representation of nsSNVs in specific domains, pathways or user-defined protein lists. The underlying datasets are updated once every 3 months. SNVDis is freely available at

Page 122–126

Application Note

pepgrep: A Tool for Peptide MS/MS Pattern Matching

Igor Chernukhin

Typically, detection of protein sequences in collision-induced dissociation (CID) tandem MS (MS2) dataset is performed by mapping identified peptide ions back to protein sequence by using the protein database search (PDS) engine. Finding a particular peptide sequence of interest in CID MS2 records very often requires manual evaluation of the spectrum, regardless of whether the peptide-associated MS2 scan is identified by PDS algorithm or not. We have developed a compact cross-platform database-free command-line utility, pepgrep, which helps to find an MS2 fingerprint for a selected peptide sequence by pattern-matching of modelled MS2 data using Peptide-to-MS2 scoring algorithm. pepgrep can incorporate dozens of mass offsets corresponding to a variety of post-translational modifications (PTMs) into the algorithm. Decoy peptide sequences are used with the tested peptide sequence to reduce false-positive results. The engine is capable of screening an MS2 data file at a high rate when using a cluster computing environment. The matched MS2 spectrum can be displayed by using built-in graphical application programming interface (API) or optionally recorded to file. Using this algorithm, we were able to find extra peptide sequences in studied CID spectra that were missed by PDS identification. Also we found pepgrep especially useful for examining a CID of small fractions of peptides resulting from, for example, affinity purification techniques. The peptide sequences in such samples are less likely to be positively identified by using routine protein-centric algorithm implemented in PDS. The software is freely available at

Page 127–132