Articles Online (Volume 3, Issue 2)


Guest Editor’s Forward

Louxin Zhang

This is the special issue devoted to bioinformatics re- search in Singapore. Bioinformatics research in Singa- pore started largely in 1996 when the Bioinformatics Center, National University of Singapore, was formed. With the government’s efforts to turn Singapore into a powerhouse of biomedical research, the Genome Insti- tute of Singapore and the Bioinformatics Institute of Singapore have been established since 2000. Recently, a bioinformatics research center was also formed in the Nanyang Technological University. Currently, there are a large number of bioinformatics research teams in each of these institutions.

Page 61

Review Article

MicroRNA: Biological and Computational Perspective

Yong Kong,Jin-Hua Han

MicroRNAs (miRNAs) are endogenously expressed non-coding RNAs of 20–24 nucleotides, which post-transcriptionally regulate gene expression in plants and animals. Recently it has been recognized that miRNAs comprise one of the abun- dant gene families in multicellular species, and their regulatory functions in various biological processes are widely spread. There has been a surge in the research ac- tivities in this field in the past few years. From the very beginning, computational methods have been utilized as indispensable tools, and many discoveries have been obtained through combination of experimental and computational approaches. In this review, both biological and computational aspects of miRNA will be discussed. A brief history of the discovery of miRNA and discussion of microarray applications in miRNA research are also included.

Page 62-72

Research Article

Feature Selection for the Prediction of Translation Initiation Sites

Guo-Liang Li,Tze-Yun Leong

Translation initiation sites (TISs) are important signals in cDNA sequences. In many previous attempts to predict TISs in cDNA sequences, three major factors affect the prediction performance: the nature of the cDNA sequence sets, the rel- evant features selected, and the classification methods used. In this paper, we examine different approaches to select and integrate relevant features for TIS pre- diction. The top selected significant features include the features from the position weight matrix and the propensity matrix, the number of nucleotide C in the se- quence downstream ATG, the number of downstream stop codons, the number of upstream ATGs, and the number of some amino acids, such as amino acids A and D. With the numerical data generated from these features, different classifi- cation methods, including decision tree, na ̈ıve Bayes, and support vector machine, were applied to three independent sequence sets. The identified significant features were found to be biologically meaningful, while the experiments showed promising results.

Page 73-83

Research Article

A Hybrid SOM-SVM Approach for the Zebrafish Gene Expression Analysis

Wei Wu,Xin Liu,Min Xu,Jin-Rong Peng,Rudy Setiono

Microarray technology can be employed to quantitatively measure the expression of thousands of genes in a single experiment. It has become one of the main tools for global gene expression analysis in molecular biology research in recent years. The large amount of expression data generated by this technology makes the study of certain complex biological problems possible, and machine learning methods are expected to play a crucial role in the analysis process. In this paper, we present our results from integrating the self-organizing map (SOM) and the support vector machine (SVM) for the analysis of the various functions of zebrafish genes based on their expression. The most distinctive characteristic of our zebrafish gene expression is that the number of samples of different classes is imbalanced. We discuss how SOM can be used as a data-filtering tool to improve the classification performance of the SVM on this data set.

Page 84-93

Research Article

The Evolutionary Relationship of the Domain Architectures in the RhoGEF-containing Proteins

Qing-Lan Sun,Hong-Jun Zhou,Kui Lin

Domain insertions and deletions lead to variations in the domain architectures of the proteins from their common ancestor. In this work, we investigated four groups of the RhoGEF-containing proteins from different organisms with domain archi- tectures RhoGEF-PH-SH3, SH3-RhoGEF-PH, RhoGEF-PH, and SH3-RhoGEF defined in the Pfam database. The phylogenetic trees were constructed using each individual domain and/or the combinations of all the domains. The phyloge- netic analysis suggests that RhoGEF-PH-SH3 and SH3-RhoGEF-PH might have evolved from RhoGEF-PH through the insertion of SH3 independently, while SH3- RhoGEF of proteins in fruit fly might have evolved from SH3-RhoGEF-PH by the degeneration of PH domain.

Page 94-106

Research Article

FAMCS: Finding All Maximal Common Substructures in Proteins

Zhen Yao,Juan Xiao,Anthony K. H. Tung,Wing Kin Sung

Finding the common substructures shared by two proteins is considered as one of the central issues in computational biology because of its usefulness in understand- ing the structure-function relationship and application in drug and vaccine design. In this paper, we propose a novel algorithm called FAMCS (Finding All Maximal Common Substructures) for the common substructure identification problem. Our method works initially at the protein secondary structural element (SSE) level and starts with the identification of all structurally similar SSE pairs. These SSE pairs are then merged into sets using a modified Apriori algorithm, which will test the similarity of various sets of SSE pairs incrementally until all the maximal sets of SSE pairs that deemed to be similar are found. The maximal common substructures of the two proteins will be formed from these maximal sets. A refinement algorithm is also proposed to fine tune the alignment from the SSE level to the residue level. Comparison of FAMCS with other methods on various proteins shows that FAMCS can address all four requirements and infer interesting biological discoveries.

Page 107-119

Research Article

Sorting by Restricted-Length-Weighted Reversals

Thach Cam Nguyen,Hieu Trung Ngo,Nguyen Bao Nguyen

Classical sorting by reversals uses the unit-cost model, that is, each reversal con- sumes an equal cost. This model limits the biological meaning of sorting by reversal. Bender and his colleagues extended it by assigning a cost function f(l) = lα for all α ≥ 0, where l is the length of the reversed subsequence. In this paper, we extend their results by considering a model in which long reversals are prohibited. Using the same cost function above for permitted reversals, we present tight or nearly tight bounds for the worst-case cost of sorting by reversals. Then we develop al- gorithms to approximate the optimal cost to sort a given 0/1 sequence as well as a given permutation. Our proposed problems are more biologically meaningful and more algorithmically general and challenging than the problem considered by Bender et al. Furthermore, our bounds are tight and nearly tight, whereas our al- gorithms provide good approximation ratios compared to the optimal cost to sort 0/1 sequences or permutations by reversals.

Page 120-127