A Content-Centric Organization of the Genetic Code

Jun Yu

The codon table for the canonical genetic code can be rearranged in such a way that the code is divided into four quarters and two halves according to the variability of their GC and purine contents, respectively. For prokaryotic genomes, when the genomic GC content increases, their amino acid contents tend to be restricted to the GC-rich quarter and the purine-content insensitive half, where all codons are fourfold degenerate and relatively mutation-tolerant. Conversely, when the genomic GC content decreases, most of the codons retract to the AU-rich quarter and the purine-content sensitive half; most of the codons not only remain encoding physicochemically diversified amino acids but also vary when transversion (between purine and pyrimidine) happens. Amino acids with sixfold-degenerate codons are distributed into all four quarters and across the two halves; their fourfold-degenerate codons are all partitioned into the purine-insensitive half in favorite of robustness against mutations. The features manifested in the rearranged codon table explain most of the intrinsic relationship between protein coding sequences (the informational content) and amino acid compositions (the functional content). The renovated codon table is useful in predicting abundant amino acids and positioning the amino acids with related or distinct physicochem-ical properties.

A Brief Review of Short Tandem Repeat Mutation

Hao Fan, Jia-You Chu

Short tandem repeats (STRs) are short tandemly repeated DNA sequences that involve a repetitive unit of 1–6 bp. Because of their polymorphisms and high mutation rates, STRs are widely used in biological research. Strand-slippage replication is the predominant mutation mechanism of STRs, and the stepwise mutation model is regarded as the main mutation model. STR mutation rates can be influenced by many factors. Moreover, some trinucleotide repeats are associated with human neurodegenerative diseases. In order to deepen our knowledge of these diseases and broaden STR application, it is essential to understand the STR mutation process in detail. In this review, we focus on the current known information about STR mutation.

Analysis of Pathway Activity in Primary Tumors and NCI60 Cell Lines Using Gene Expression Profiling Data

Xing-Dong Feng, Shu-Guang Huang, Jian-Yong Shou, Bi-Rong Liao, Jonathan M. Yingling, Xiang Ye, Xi Lin, Lawrence M. Gelbert, Eric W. Su, Jude E. Onyia, Shu-Yu Li

To determine cancer pathway activities in nine types of primary tumors and NCI60 cell lines, we applied an in silica approach by examining gene signatures reflective of consequent pathway activation using gene expression data. Supervised learning approaches predicted that the Ras pathway is active in ∼70% of lung adenocarci-nomas but inactive in most squamous cell carcinomas, pulmonary carcinoids, and small cell lung carcinomas. In contrast, the TGF-β, TNF-α, Src, Myc, E2F3, and β-catenin pathways are inactive in lung adenocarcinomas. We predicted an active Ras, Myc, Src, and/or E2F3 pathway in significant percentages of breast cancer, colorectal carcinoma, and gliomas. Our results also suggest that Ras may be the most prevailing oncogenic pathway. Additionally, many NCI60 cell lines exhibited a gene signature indicative of an active Ras, Myc, and/or Src, but not E2F3, β-catenin, TNF-α, or TGF-β pathway. To our knowledge, this is the first comprehensive survey of cancer pathway activities in nine major tumor types and the most widely used NCI60 cell lines. The “gene expression pathway signatures” we have defined could facilitate the understanding of molecular mechanisms in cancer development and provide guidance to the selection of appropriate cell lines for cancer research and pharmaceutical compound screening.

Alternative Splicing and Expression Profile Analysis of Expressed Sequence Tags in Domestic Pig

Liang Zhang , Lin Tao, Lin Ye, Ling He, Yuan-Zhong Zhu, Yue-Dong Zhu, Yan Zhou

Domestic pig (Sus scrofa domestica) is one of the most important mammals to humans. Alternative splicing is a cellular mechanism in eukaryotes that greatly increases the diversity of gene products. Expression sequence tags (ESTs) have been widely used for gene discovery, expression profile analysis, and alternative splicing detection. In this study, a total of 712,905 ESTs extracted from 101 different non-normalized EST libraries of the domestic pig were analyzed. These EST libraries cover the nervous system, digestive system, immune system, and meat production related tissues from embryo, newborn, and adult pigs, making contributions to the analysis of alternative splicing variants as well as expression profiles in various stages of tissues. A modified approach was designed to cluster and assemble large EST datasets, aiming to detect alternative splicing together with EST abundance of each splicing variant. Much efforts were made to classify alternative splicing into different types and apply different filters to each type to get more reliable results. Finally, a total of 1,223 genes with average 2.8 splicing variants were detected among 16,540 unique genes. The overview of expression profiles would change when we take alternative splicing into account.

Comparative Analysis of the 100 kb Region Containing the Pi-kh Locus Between indica and japonica Rice Lines

S.P. Kumar, V. Dalai, N.K. Singh, T.R. Sharma

We have recently cloned a pathogen inducible blast resistance gene Pi-kh from the indica rice line Tetep using a positional cloning approach. In this study, we carried out structural organization analysis of the Pi-kh locus in both indica and japonica rice lines. A 100 kb region containing 50 kb upstream and 50 kb downstream sequences flanking to the Pi-kh locus was selected for the investigation. A total of 16 genes in indica and 15 genes in japonica were predicted and annotated in this region. The average GC content of indica and japonica genes in this region was 53.15% and 49.3%, respectively. Both indica and japonica sequences were polymorphic for simple sequence repeats having mono-, di-, tri-, tetra-, and pentanucleotides. Sequence analysis of the specific blast resistant Pi-kh allele of Tetep and the susceptible Pi-kh allele of the japonica rice line Nipponbare showed differences in the number and distribution of motifs involved in phosphorylation, resulting in the resistance phenotype in Tetep.

Generation of Synthetic Transcriptome Data with Defined Statistical Properties for the Development and Testing of New Analysis Methods

Guillaume Brysbaert, Sebastian Noth, Arndt Benecke

We have previously developed a combined signal/variance distribution model that accounts for the particular statistical properties of datasets generated on the Applied Biosystems AB1700 transcriptome system. Here we show that this model can be efficiently used to generate synthetic datasets with statistical properties virtually identical to those of the actual data by aid of the JAVA application creator 1.0 that we have developed. The fundamentally different structure of AB1700 transcriptome profiles requires re-evaluation, adaptation, or even redevelopment of many of the standard microarray analysis methods in order to avoid misinterpretation of the data on the one hand, and to draw full benefit from their increased specificity and sensitivity on the other hand. Our composite data model and the creator 1.0 application thereby not only present proof of the correctness of our parameter estimation, but also provide a tool for the generation of synthetic test data that will be useful for further development and testing of analysis methods.

Restauro-G: A Rapid Genome Re-Annotation System for Comparative Genomics

Satoshi Tamaki, Kazuharu Arakawa, Nobuaki Kono, Masaru Tomita

Annotations of complete genome sequences submitted directly from sequencing projects are diverse in terms of annotation strategies and update frequencies. These inconsistencies make comparative studies difficult. To allow rapid data preparation of a large number of complete genomes, automation and speed are important for genome re-annotation. Here we introduce an open-source rapid genome re-annotation software system, Restauro-G, specialized for bacterial genomes. Restauro-G re-annotates a genome by similarity searches utilizing the BLAST-Like Alignment Tool, referring to protein databases such as UniProt KB, NCBI nr, NCBI COGs, Pfam, and PSORTb. Re-annotation by Restauro-G achieved over 98% accuracy for most bacterial chromosomes in comparison with the original manually curated annotation of EMBL releases. Restauro-G was developed in the generic bioinformatics workbench G-language Genome Analysis Environment and is distributed at the GNU General Public License.

Genetic Polymorphisms of Nine X-STR Loci in Four Population Groups from Inner Mongolia, China

Qiao-Fang Hou, Bin Yu, Sheng-Bin Li

Nine short tandem repeat (STR) markers on the X chromosome (DXS101, DXS6789, DXS6799, DXS6804, DXS7132, DXS7133, DXS7423, DXS8378, and HPRTB) were analyzed in four population groups (Mongol, Ewenki, Oroqen, and Daur) from Inner Mongolia, China, in order to learn about the genetic diversity, forensic suitability, and possible genetic affinities of the populations. Frequency estimates, Hardy-Weinberg equilibrium, and other parameters of forensic interest were computed. The results revealed that the nine markers have a moderate degree of variability in the population groups. Most heterozygosity values for the nine loci range from 0.480 to 0.891, and there are evident differences of genetic variability among the populations. A UPGMA tree constructed on the basis of the generated data shows very low genetic distance betweent Mongol and Han (Xi'an) populations. Our results based on genetic distance analysis are consistent with the results of earlier studies based on linguistics and the immigration history and origin of these populations. The minisatellite loci on the X chromosome studied here are not only useful in showing significant genetic variation between the populations, but also are suitable for human identity testing among Inner Mongolian populations.

Genetic Analysis of 15 STR Loci in Chinese Han Population from West China

Ya-Jun Deng, Jiang-Wei Yan, Xiao-Guang Yu, Yuan-Zhe Li, Hao-Fang Mu, Yan-Qing Huang, Xiao-Tie Shi, Wei-Min Sun

Allele frequencies for 15 short tandem repeat (STR) loci (D8S1179, D21S11, D7S820, CSF1PO, D3S1358, TH01, D13S317, D16S539, D2S1338, D19S433, vWA, TPOX, D18S51, D5S818, and FGA) were obtained from 7,636 unrelated individuals of Chinese Han population living in Qinghai and Chongqing, China. Totally 206 alleles were observed, with the corresponding allele frequencies ranging from 0.0001–0.4982. Chi-square test showed that all of the STR loci agreed with the Hardy-Weinberg equilibrium. We also compared our data with previously published population data of other ethnics or areas. The results are valuable for human identification and paternity testing in Chinese Han population.

A Survey of the Availability of Primary Bioinformatics Web Resources

Trias Thireou, George Spyrou, Vassilis Atlamazoglou

The explosive growth of the bioinformatics field has led to a large amount of data and software applications publicly available as web resources. However, the lack of persistence of web references is a barrier to a comprehensive shared access. We conducted a study of the current availability and other features of primary bioinformatics web resources (such as software tools and databases). The majority (95%) of the examined bioinformatics web resources were found running on UNIX/Linux operating systems, and the most widely used web server was found to be Apache (or Apache-related products). Of the overall 1,130 Uniform Resource Locators (URLs) examined, 91% were highly available (more than 90% of the time), while only 4% showed low accessibility (less than 50% of the time) during the survey. Furthermore, the most common URL failure modes are presented and analyzed.

