Articles Online (Volume 3, Issue 4)

Editorail

Science and Scientificity

Zongliang Xu

We are now living in a scientific era, in which the theory and practice of science have penetrated into all aspects of society and science is often a hot topic. However, what on earth is science? This question is largely neglected by many people, even researchers fo- cusing on scientific studies may not have a very clear understanding of it.
none

Page 197-200


Research Article

Classifying Genomic Sequences by Sequence Feature Analysis

Zhihua Liu,Dian Jiao,Xiao Sun

Traditional sequence analysis depends on sequence alignment. In this study, we analyzed various functional regions of the human genome based on sequence fea- tures, including word frequency, dinucleotide relative abundance, and base-base correlation. We analyzed the human chromosome 22 and classified the upstream, exon, intron, downstream, and intergenic regions by principal component analysis and discriminant analysis of these features. The results show that we could clas- sify the functional regions of genome based on sequence feature and discriminant analysis.
none

Page 201-205


Research Article

LZ Complexity Distance of DNA Sequences and Its Application in Phylogenetic Tree Reconstruction

Bin Li,Yibing Li,Hongbo He

DNA sequences can be treated as finite-length symbol strings over a four-letter alphabet (A, C, T, G). As a universal and computable complexity measure, LZ complexity is valid to describe the complexity of DNA sequences. In this study, a concept of conditional LZ complexity between two sequences is proposed according to the principle of LZ complexity measure. An LZ complexity distance metric between two nonnull sequences is defined by utilizing conditional LZ complexity. Based on LZ complexity distance, a phylogenetic tree of 26 species of placental mammals (Eutheria) with three outgroup species was reconstructed from their complete mitochondrial genomes. On the debate that which two of the three main groups of placental mammals, namely Primates, Ferungulates, and Rodents, are more closely related, the phylogenetic tree reconstructed based on LZ complexity distance supports the suggestion that Primates and Ferungulates are more closely related.
none

Page 206-212


Research Article

Characterizing the Microenvironment Surrounding Phosphorylated Protein Sites

Shicai Fan,Xuegong Zhang

Protein phosphorylation plays an important role in various cellular processes. Due to its high complexity, the mechanism needs to be further studied. In the last few years, many methods have been contributed to this field, but almost all of them investigated the mechanism based on protein sequences around protein sites. In this study, we implement an exploration by characterizing the microenvironment surrounding phosphorylated protein sites with a modified shell model, and obtain some significant properties by the rank-sum test, such as the lack of some classes of residues, atoms, and secondary structures. Furthermore, we find that the de- pletion of some properties affects protein phosphorylation remarkably. Our results suggest that it is a meaningful direction to explore the mechanism of protein phos- phorylation from microenvironment and we expect further findings along with the increasing size of phosphorylation and protein structure data.
none

Page 213-217


Research Article

A Contact Energy Function Considering Residue Hydrophobic Environment and Its Application in Protein Fold Recognition

Mojie Duan,Yanhong Zhou

The three-dimensional (3D) structure prediction of proteins is an important task in bioinformatics. Finding energy functions that can better represent residue-residue and residue-solvent interactions is a crucial way to improve the prediction accu- racy. The widely used contact energy functions mostly only consider the contact frequency between different types of residues; however, we find that the contact frequency also relates to the residue hydrophobic environment. Accordingly, we present an improved contact energy function to integrate the two factors, which can reflect the influence of hydrophobic interaction on the stabilization of protein 3D structure more effectively. Furthermore, a fold recognition (threading) approach based on this energy function is developed. The testing results obtained with 20 randomly selected proteins demonstrate that, compared with common contact en- ergy functions, the proposed energy function can improve the accuracy of the fold template prediction from 20% to 50%, and can also improve the accuracy of the sequence-template alignment from 35% to 65%.
none

Page 218-224


Research Article

A Branch and Bound Algorithm for the Protein Folding Problem in the HP Lattice Model

Mao Chen,Wenqi Huang

A branch and bound algorithm is proposed for the two-dimensional protein folding problem in the HP lattice model. In this algorithm, the benefit of each possible location of hydrophobic monomers is evaluated and only promising nodes are kept for further branching at each level. The proposed algorithm is compared with other well-known methods for 10 benchmark sequences with lengths ranging from 20 to 100 monomers. The results indicate that our method is a very efficient and promising tool for the protein folding problem.
none

Page 225-230


Research Article

Preprocessing of Tandem Mass Spectrometric Data Based on Decision Tree Classification

Jinfen Zhang,Simin He,Jinjin Cai,Xingjun Cao,Ruixiang Sun,Yan Fu,Rong Zeng,Wen Gao

In this study, we present a preprocessing method for quadrupole time-of-flight (Q-TOF) tandem mass spectra to increase the accuracy of database searching for peptide (protein) identification. Based on the natural isotopic information inher- ent in tandem mass spectra, we construct a decision tree after feature selection to classify the noise and ion peaks in tandem spectra. Furthermore, we recognize overlapping peaks to find the monoisotopic masses of ions for the following iden- tification process. The experimental results show that this preprocessing method increases the search speed and the reliability of peptide identification.
none

Page 231-237


Research Article

Constructing Support Vector Machine Ensembles for Cancer Classification Based on Proteomic Profiling

Yong Mao,Xiaobo Zhou,Daoying Pi,Youxian Sun

In this study, we present a constructive algorithm for training cooperative support vector machine ensembles (CSVMEs). CSVME combines ensemble architecture design with cooperative training for individual SVMs in ensembles. Unlike most previous studies on training ensembles, CSVME puts emphasis on both accuracy and collaboration among individual SVMs in an ensemble. A group of SVMs se- lected on the basis of recursive classifier elimination is used in CSVME, and the number of the individual SVMs selected to construct CSVME is determined by 10-fold cross-validation. This kind of SVME has been tested on two ovarian can- cer datasets previously obtained by proteomic mass spectrometry. By combining several individual SVMs, the proposed method achieves better performance than the SVME of all base SVMs.
none

Page 238-241


Research Article

Prediction and Classification of Human G-protein Coupled Receptors Based on Support Vector Machines

Yunfei Wang,Huan Chen,Yanhong Zhou

A computational system for the prediction and classification of human G-protein coupled receptors (GPCRs) has been developed based on the support vector ma- chine (SVM) method and protein sequence information. The feature vectors used to develop the SVM prediction models consist of statistically significant features selected from single amino acid, dipeptide, and tripeptide compositions of pro- tein sequences. Furthermore, the length distribution difference between GPCRs and non-GPCRs has also been exploited to improve the prediction performance. The testing results with annotated human protein sequences demonstrate that this system can get good performance for both prediction and classification of human GPCRs.
none

Page 242-246


Research Article

Predicting the Coupling Specificity of G-protein Coupled Receptors to G-proteins by Support Vector Machines

Cuiping Guan,Zhenran Jiang,Yanhong Zhou

G-protein coupled receptors (GPCRs) represent one of the most important classes of drug targets for pharmaceutical industry and play important roles in cellular signal transduction. Predicting the coupling specificity of GPCRs to G-proteins is vital for further understanding the mechanism of signal transduction and the func- tion of the receptors within a cell, which can provide new clues for pharmaceutical research and development. In this study, the features of amino acid compositions and physiochemical properties of the full-length GPCR sequences have been ana- lyzed and extracted. Based on these features, classifiers have been developed to predict the coupling specificity of GPCRs to G-proteins using support vector ma- chines. The testing results show that this method could obtain better prediction accuracy.
none

Page 247-251


Research Article

Identifying G-protein Coupled Receptors Using Weighted Levenshtein Distance and Nearest Neighbor Method

Jianhua Xu

G-protein coupled receptors (GPCRs) are a class of seven-helix transmembrane proteins that have been used in bioinformatics as the targets to facilitate drug discovery for human diseases. Although thousands of GPCR sequences have been collected, the ligand specificity of many GPCRs is still unknown and only one crystal structure of the rhodopsin-like family has been solved. Therefore, iden- tifying GPCR types only from sequence data has become an important research issue. In this study, a novel technique for identifying GPCR types based on the weighted Levenshtein distance between two receptor sequences and the nearest neighbor method (NNM) is introduced, which can deal with receptor sequences with different lengths directly. In our experiments for classifying four classes (acetylcholine, adrenoceptor, dopamine, and serotonin) of the rhodopsin-like family of GPCRs, the error rates from the leave-one-out procedure and the leave-half-out procedure were 0.62% and 1.24%, respectively. These results are prior to those of the covariant discriminant algorithm, the support vector machine method, and the NNM with Euclidean distance.
none

Page 252-257