Article Online

Articles Online (Volume 4, Issue 2)


Characterization of Binding Sites of Eukaryotic Transcription Factors

Jiang Qian, Jimmy Lina, Donald J. Zack

To explore the nature of eukaryotic transcription factor (TF) binding sites and determine how they differ from surrounding DNA sequences, we examined four features associated with DNA binding sites: G+C content, pattern complexity, palindromic structure, and Markov sequence ordering. Our analysis of the regulatory motifs obtained from the TRANSFAC database, using yeast intergenic sequences as background, revealed that these four features show variable enrichment in motif sequences. For example, motif sequences were more likely to have palindromic structure than were background sequences. In addition, these features were tightly localized to the regulatory motifs, indicating that they are a property of the motif sequences themselves and are not shared by the general promoter “environment” in which the regulatory motifs reside. By breaking down the motif sequences according to the TF classes to which they bind, more specific associations were identified. Finally, we found that some correlations, such as G+C content enrichment, were species-specific, while others, such as complexity enrichment, were universal across the species examined. The quantitative analysis provided here should increase our understanding of protein-DNA interactions and also help facilitate the discovery of regulatory motifs through bioinformatics.

Page 67–79


Topological Properties of Protein-Protein and Metabolic Interaction Networks of Drosophila melanogaster

Thanigaimani Rajarathinam, Yen-Han Lin

The underlying principle governing the natural phenomena of life is one of the critical issues receiving due importance in recent years. A key feature of the scale-free architecture is the vitality of the most connected nodes (hubs). The major objective of this article was to analyze the protein-protein and metabolic interaction networks of Drosophila melanogaster by considering the architectural patterns and the consequence of removal of hubs on the topological parameter of the two interaction systems. Analysis showed that both interaction networks follow a scale-free model, establishing the fact that most real world networks, from varied situations, conform to the small world pattern. The average path length showed a two-fold and a three-fold increase (changing from 9.42 to 20.93 and from 5.29 to 17.75, respectively) for the protein-protein and metabolic interaction networks, respectively, due to the deletion of hubs. On the contrary, the arbitrary elimination of nodes did not show any remarkable disparity in the topological parameter of the protein-protein and metabolic interaction networks (average path length: 9.42±0.02 and 5.27±0.01, respectively). This aberrant behavior for the two cases underscores the significance of the most linked nodes to the natural topology of the networks.

Page 80–89


Normalization Using Weighted Negative Second Order Exponential Error Functions (NeONORM) Provides Robustness Against Asymmetries in Comparative Transcriptome Profiles and Avoids False Calls

Sebastian Noth, Guillaume Brysbaert, Arndt Benecke

Studies on high-throughput global gene expression using microarray technology have generated ever larger amounts of systematic transcriptome data. A major challenge in exploiting these heterogeneous datasets is how to normalize the expression profiles by inter-assay methods. Different non-linear and linear normalization methods have been developed, which essentially rely on the hypothesis that the true or perceived logarithmic fold-change distributions between two different assays are symmetric in nature. However, asymmetric gene expression changes are frequently observed, leading to suboptimal normalization results and in consequence potentially to thousands of false calls. Therefore, we have specifically investigated asymmetric comparative transcriptome profiles and developed the normalization using weighted negative second order exponential error functions (NeONORM) for robust and global inter-assay normalization. NeONORM efficiently damps true gene regulatory events in order to minimize their misleading impact on the normalization process. We evaluated NeONORM's applicability using artificial and true experimental datasets, both of which demonstrated that NeONORM could be systematically applied to inter-assay and inter-condition comparisons.

Page 90–109


Improve Survival Prediction Using Principal Components of Gene Expression Data

Yi-Jing Shen, Shu-Guang Huang

The purpose of many microarray studies is to find the association between gene expression and sample characteristics such as treatment type or sample phenotype. There has been a surge of efforts developing different methods for delineating the association. Aside from the high dimensionality of microarray data, one well recognized challenge is the fact that genes could be complicatedly inter-related, thus making many statistical methods inappropriate to use directly on the expression data. Multivariate methods such as principal component analysis (PCA) and clustering are often used as a part of the effort to capture the gene correlation, and the derived components or clusters are used to describe the association between gene expression and sample phenotype. We propose a method for patient population dichotomization using maximally selected test statistics in combination with the PCA method, which shows favorable results. The proposed method is compared with a currently well-recognized method.

Page 110–119


Predicting the Subcellular Localization of Human Proteins Using Machine Learning and Exploratory Data Analysis

George K. Acquaah-Mensah, Sonia M. Leach, Chittibabu Guda

Identifying the subcellular localization of proteins is particularly helpful in the functional annotation of gene products. In this study, we use Machine Learning and Exploratory Data Analysis (EDA) techniques to examine and characterize amino acid sequences of human proteins localized in nine cellular compartments. A dataset of 3,749 protein sequences representing human proteins was extracted from the SWISS-PROT database. Feature vectors were created to capture specific amino acid sequence characteristics. Relative to a Support Vector Machine, a Multi-layer Perceptron, and a Naïve Bayes classifier, the C4.5 Decision Tree algorithm was the most consistent performer across all nine compartments in reliably predicting the subcellular localization of proteins based on their amino acid sequences (average Precision=0.88; average Sensitivity=0.86). Furthermore, EDA graphics characterized essential features of proteins in each compartment. As examples, proteins localized to the plasma membrane had higher proportions of hydrophobic amino acids; cytoplasmic proteins had higher proportions of neutral amino acids; and mitochondrial proteins had higher proportions of neutral amino acids and lower proportions of polar amino acids. These data showed that the C4.5 classifier and EDA tools can be effective for characterizing and predicting the subcellular localization of human proteins based on their amino acid sequences.

Page 120–133


PHProteomicDB: A Module for Two-dimensional Gel Electrophoresis Database Creation on Personal Web Sites

Pascal Pernet , Arnaud Bruneel, Bruno Baudin, Michel Vaubourdolle

PHProteomicDB is a PHP-written module to help researchers in proteomics to share two-dimensional gel electrophoresis data using personal web sites. No technical or PHP knowledge is necessary except a few basics about web site management. PHProteomicDB has a user-friendly administration interface to enter and update data. It creates web pages on the fly displaying gel characteristics, gel pictures, and numbered gel spots with their related identifications pointing to their reference pages in protein databanks. The module is freely available at

Page 134–136