Article Online

Articles Online (Volume 19, Issue 6)


ASER: Animal Sex Reversal Database

Yangyang Li, Zonggui Chen, Hairong Liu, Qiming Li, Xing Lin, Shuhui Ji, Rui Li, Shaopeng Li, Weiliang Fan, Haiping Zhao, Zuoyan Zhu, Wei Hu, Yu Zhou, Daji Luo

Sex reversal, representing extraordinary sexual plasticity during the life cycle, not only triggers reproduction in animals but also affects reproductive and endocrine system-related diseases and cancers in humans. Sex reversal has been broadly reported in animals; however, an integrated resource hub of sex reversal information is still lacking. Here, we constructed a comprehensive database named ASER (Animal Sex Reversal) by integrating sex reversal-related data of 18 species from teleostei to mammalia. We systematically collected 40,018 published papers and mined the sex reversal-associated genes (SRGs), including their regulatory networks, from 1611 core papers. We annotated homologous genes and computed conservation scores for whole genomes across the 18 species. Furthermore, we collected available RNA-seq datasets and investigated the expression dynamics of SRGs during sex reversal or sex determination processes. In addition, we manually annotated 550 in situ hybridization (ISH), fluorescence in situ hybridization (FISH), and immunohistochemistry (IHC) images of SRGs from the literature and described their spatial expression in the gonads. Collectively, ASER provides a unique and integrated resource for researchers to query and reuse organized data to explore the mechanisms and applications of SRGs in animal breeding and human health. The ASER database is publicly available at
研究问题: 动物性反转 (Animal Sex Reversal database, ASER)数据库的构建及其应用。 研究方法: 通过整合多种生物信息和图像分析工具构建了动物性反转数据库,发掘了性反转候选基因(Sex Reversal-associated Genes, SRGs)、其调控元件以及调控网络,并提供了SRGs的时空表达等实验数据信息。 主要结果1: 构建了从水生到陆生(涵盖了鱼类、两栖、爬行、鸟类和哺乳类)有性反转报道和相关组学数据的18种代表性性反转动物的组学信息库。 主要结果2: 发掘了18种性反转动物的SRGs、其调控元件以及调控网络。 主要结果3: 提供了18种性反转动物的不同性别、性腺发育时期、性反转过程等RNA-seq数据,以及SRGs在性反转过程中的时间、空间表达等实验数据信息。 数据库链接:。

Page 873-881

Original Research

Specificity of mRNA Folding and Its Association with Evolutionarily Adaptive mRNA Secondary Structures

Gongwang Yu, Hanbing Zhu, Xiaoshu Chen, Jian-Rong Yang

The secondary structure is a fundamental feature of both non-coding RNAs (ncRNAs) and messenger RNAs (mRNAs). However, our understanding of the secondary structures of mRNAs, especially those of the coding regions, remains elusive, likely due to translation and the lack of RNA-binding proteins that sustain the consensus structure like those binding to ncRNAs. Indeed, mRNAs have recently been found to adopt diverse alternative structures, but the overall functional significance remains untested. We hereby approach this problem by estimating the folding specificity, i.e., the probability that a fragment of an mRNA folds back to the same partner once refolded. We show that the folding specificity of mRNAs is lower than that of ncRNAs and exhibits moderate evolutionary conservation. Notably, we find that specific rather than alternative folding is likely evolutionarily adaptive since specific folding is frequently associated with functionally important genes or sites within a gene. Additional analysis in combination with ribosome density suggests the ability to modulate ribosome movement as one potential functional advantage provided by specific folding. Our findings reveal a novel facet of the RNA structurome with important functional and evolutionary implications and indicate a potential method for distinguishing the mRNA secondary structures maintained by natural selection from molecular noise.
作为遗传信息的“信使”,mRNA除了携带编码氨基酸序列的密码子以外,其本身作为长单链核酸分子的物理性质,决定了它还能折叠成复杂的二级结构。这些二级结构能对转录、翻译等核心分子生物学过程形成重要的调节作用。但是,比起非编码RNA(ncRNA)分子,mRNA的二级结构可能因为翻译中核糖体的干扰而发生动态变化,而目前我们对mRNA二级结构的动态变化仍然知之甚少。要从变化中出现的mRNA的各种构象中寻找具有生物学功能的结构,则更是难上加难。为了解决这些问题,我们作出以下科学假设:如果来自同一基因的mRNA分子的结构过于多样化,则其包含每种结构的mRNA分子数都会较少,因而不利于特定结构执行其生物学功能;相反,高特异性折叠的结构会使更多的mRNA分子包含相同的结构,使其更可能具有生物学功能。基于上述假设,我们提出“RNA 折叠特异性”的概念,即一个 RNA片段的结构发生变化并重新折叠后,其与其特定其他RNA片段重新配对的概率——其概率越大,折叠特异性越高。在本研究中,我们使用基于信息论的算法分析了公开数据中鉴定 mRNA 分子内折叠配对关系的高通量测序数据,并在基因组范围对 mRNA 折叠配对的特异性进行系统的研究。我们发现:(i)相比低表达、序列保守性较低的 mRNA,高表达、高度保守的 mRNA 的二级结构具有更高的折叠特异性;(ii)同一 mRNA 内部,二级结构折叠特异性更高的碱基,其序列倾向于更为保守;(iii)折叠特异性较高的 mRNA 片段会对核糖体翻译速度产生影响;(iv)相比非功能区域,已知有功能的mRNA结构具有更高的折叠特异性。上述发现提示:尽管在细胞内有核糖体的干扰,mRNA 二级结构中存在的分子内相互作用仍能在进化过程中被自然选择优化,形成有功能活性、特异性较高的特定二级结构。此外,配对较为特异的二级结构在进化上的优势可能来自于其对翻译延伸速率的调节能力。我们的研究首次揭示了 mRNA 二级结构的折叠特异性关系对 mRNA 序列进化的重要影响,为更深入认识 mRNA 二级结构的功能,其影响基因序列进化的机制,乃至从海量二级结构中鉴定出有功能的二级结构提供了新的思路。

Page 882-900

Original Research

Differential Splicing of Skipped Exons Predicts Drug Response in Cancer Cell Lines

Edward Simpson, Steven Chen, Jill L. Reiter, Yunlong Liu

Alternative splicing of pre-mRNA transcripts is an important regulatory mechanism that increases the diversity of gene products in eukaryotes. Various studies have linked specific transcript isoforms to altered drug response in cancer; however, few algorithms have incorporated splicing information into drug response prediction. In this study, we evaluated whether basal-level splicing information could be used to predict drug sensitivity by constructing doxorubicin-sensitivity classification models with splicing and expression data. We detailed splicing differences between sensitive and resistant cell lines by implementing quasi-binomial generalized linear modeling (QBGLM) and found altered inclusion of 277 skipped exons. We additionally conducted RNA-binding protein (RBP) binding motif enrichment and differential expression analysis to characterize cis- and trans-acting elements that potentially influence doxorubicin response-mediating splicing alterations. Our results showed that a classification model built with skipped exon data exhibited strong predictive power. We discovered an association between differentially spliced events and epithelial-mesenchymal transition (EMT) and observed motif enrichment, as well as differential expression of RBFOX and ELAVL RBP family members. Our work demonstrates the potential of incorporating splicing data into drug response algorithms and the utility of a QBGLM approach for fast, scalable identification of relevant splicing differences between large groups of samples.

Page 901-912

Original Research

The mRNA–miRNA–lncRNA Regulatory Network and Factors Associated with Prognosis Prediction of Hepatocellular Carcinoma

Bo Hu; Xiaolu Ma; Peiyao Fu; Qiman Sun; Weiguo Tang; Haixiang Sun; Zhangfu Yang; Mincheng Yu; Jian Zhou; Jia Fan; Yang Xu

The aim of this study was to identify novel prognostic mRNA and microRNA (miRNA) biomarkers for hepatocellular carcinoma (HCC) using methods in systems biology. Differentially expressed mRNAs, miRNAs, and long non-coding RNAs (lncRNAs) were compared between HCC tumor tissues and normal liver tissues in The Cancer Genome Atlas (TCGA) database. Subsequently, a prognosis-associated mRNA co-expression network, an mRNA–miRNA regulatory network, and an mRNA–miRNA–lncRNA regulatory network were constructed to identify prognostic biomarkers for HCC through Cox survival analysis. Seven prognosis-associated mRNA co-expression modules were obtained by analyzing these differentially expressed mRNAs. An expression module including 120 mRNAs was significantly correlated with HCC patient survival. Combined with patient survival data, several mRNAs and miRNAs, including CHST4, SLC22A8, STC2, hsa-miR-326, and hsa-miR-21 were identified from the network to predict HCC patient prognosis. Clinical significance was investigated using tissue microarray analysis of samples from 258 patients with HCC. Functional annotation of hsa-miR-326 and hsa-miR-21-5p indicated specific associations with several cancer-related pathways. The present study provides a bioinformatics method for biomarker screening, leading to the identification of an integrated mRNA–miRNA–lncRNA regulatory network and their co-expression patterns in relation to predicting HCC patient survival.

Page 913-925

Original Research

Trilineage Sequencing Reveals Complex TCRβ Transcriptomes in Neutrophils and Monocytes Alongside T Cells

Tina Fuchs; Kerstin Puellmann; Chunlin Wang; Jian Han; Alexander W. Beham; Michael Neumaier; Wolfgang E. Kaminski

Recent findings indicate the presence of T cell receptor (TCR)-based combinatorial immune receptors beyond T cells in neutrophils and monocytes/macrophages. In this study, using a semiquantitative trilineage immune repertoire sequencing approach as well as under rigorous bioinformatic conditions, we identify highly complex TCRβ transcriptomes in human circulating monocytes and neutrophils that separately encode repertoire diversities one and two orders of magnitude smaller than that of T cells. Intraindividual transcriptomic analyses reveal that neutrophils, monocytes, and T cells express distinct TCRβ repertoires with less than 0.1% overall trilineage repertoire sharing. Interindividual comparison shows that in all three leukocyte lineages, the vast majority of the expressed TCRβ variants are private. We also find that differentiation of monocytes into macrophages induces dramatic individual-specific repertoire shifts, revealing a surprising degree of immune repertoire plasticity in the monocyte lineage. These results uncover the remarkable complexity of the two phagocyte-based flexible immune systems which until now has been hidden in the shadow of T cells.

Page 926-936

Original Research

Data Comparison and Software Design for Easy Selection and Application of CRISPR-based Genome Editing Systems in Plants

Yi Wang, Fatma Lecourieux, Rui Zhang, Zhanwu Dai, David Lecourieux, Shaohua Li, Zhenchang Liang

CRISPR-based genome editing systems have been successfully and effectively used in many organisms. However, only a few studies have reported the comparison between CRISPR/Cas9 and CRISPR/Cpf1 systems in the whole-genome applications. Although many web-based toolkits are available, there is still a shortage of comprehensive, user-friendly, and plant-specific CRISPR databases and desktop software. In this study, we identified and analyzed the similarities and differences between CRISPR/Cas9 and CRISPR/Cpf1 systems by considering the abundance of proto-spacer adjacent motif (PAM) sites, the effects of GC content, optimal proto-spacer length, potential universality within the plant kingdom, PAM-rich region (PARR) inhibiting ratio, and the effects of G-quadruplex (G-Q) structures. Using this information, we built a comprehensive CRISPR database (including 138 plant genome data sources,, which provides search tools for the identification of CRISPR editing sites in both CRISPR/Cas9 and CRISPR/Cpf1 systems. We also developed a desktop software on the basis of the Perl/Tk tool, which facilitates and improves the detection and analysis of CRISPR editing sites at the whole-genome level on Linux and/or Windows platform. Therefore, this study provides helpful data and software for easy selection and application of CRISPR-based genome editing systems in plants.

Page 937-948

Original Research

Selection for Cheaper Amino Acids Drives Nucleotide Usage at the Start of Translation in Eukaryotic Genes

Na L. Gao, Zilong He, Qianhui Zhu, Puzi Jiang, Songnian Hu, Wei-Hua Chen

Coding regions have complex interactions among multiple selective forces, which are manifested as biases in nucleotide composition. Previous studies have revealed a decreasing GC gradient from the 5′-end to 3′-end of coding regions in various organisms. We confirmed that this gradient is universal in eukaryotic genes, but the decrease only starts from the ∼ 25th codon. This trend is mostly found in nonsynonymous (ns) sites at which the GC gradient is universal across the eukaryotic genome. Increased GC contents at ns sites result in cheaper amino acids, indicating a universal selection for energy efficiency toward the N-termini of encoded proteins. Within a genome, the decreasing GC gradient is intensified from lowly to highly expressed genes (more and more protein products), further supporting this hypothesis. This reveals a conserved selective constraint for cheaper amino acids at the translation start that drives the increased GC contents at ns sites. Elevated GC contents can facilitate transcription but result in a more stable local secondary structure around the start codon and subsequently impede translation initiation. Conversely, the GC gradients at four-fold and two-fold synonymous sites vary across species. They could decrease or increase, suggesting different constraints acting at the GC contents of different codon sites in different species. This study reveals that the overall GC contents at the translation start are consequences of complex interactions among several major biological processes that shape the nucleotide sequences, especially efficient energy usage.

Page 949-957

Original Research

Biogeographic and Evolutionary Patterns of Trace Element Utilization in Marine Microbial World

Yinzhen Xu, Jiayu Cao, Liang Jiang, Yan Zhang

Trace elements are required by all organisms, which are key components of many enzymes catalyzing important biological reactions. Many trace element-dependent proteins have been characterized; however, little is known about their occurrence in microbial communities in diverse environments, especially the global marine ecosystem. Moreover, the relationships between trace element utilization and different types of environmental stressors are unclear. In this study, we used metagenomic data from the Global Ocean Sampling expedition project to identify the biogeographic distribution of genes encoding trace element-dependent proteins (for copper, molybdenum, cobalt, nickel, and selenium) in a variety of marine and non-marine aquatic samples. More than 56,000 metalloprotein and selenoprotein genes corresponding to nearly 100 families were predicted, becoming the largest dataset of marine metalloprotein and selenoprotein genes reported to date. In addition, samples with enriched or depleted metalloprotein/selenoprotein genes were identified, suggesting an active or inactive usage of these micronutrients in various sites. Further analysis of interactions among the elements showed significant correlations between some of them, especially those between nickel and selenium/copper. Finally, investigation of the relationships between environmental conditions and metalloprotein/selenoprotein families revealed that many environmental factors might contribute to the evolution of different metalloprotein and/or selenoprotein genes in the marine microbial world. Our data provide new insights into the utilization and biological roles of these trace elements in extant marine microbes, and might also be helpful for the understanding of how these organisms have adapted to their local environments.

Page 958-972

Original Research

eTumorMetastasis: A Network-based Algorithm Predicts Clinical Outcomes Using Whole-exome Sequencing Data of Cancer Patients

Jean-Sébastien Milanese, Chabane Tibiche, Naif Zaman, Jinfeng Zou, Pengyong Han, Zhigang Meng, Andre Nantel, Arnaud Droit, Edwin Wang

Continual reduction in sequencing cost is expanding the accessibility of genome sequencing data for routine clinical applications. However, the lack of methods to construct machine learning-based predictive models using these datasets has become a crucial bottleneck for the application of sequencing technology in clinics. Here, we develop a new algorithm, eTumorMetastasis, which transforms tumor functional mutations into network-based profiles and identifies network operational gene (NOG) signatures. NOG signatures model the tipping point at which a tumor cell shifts from a state that doesn’t favor recurrence to one that does. We show that NOG signatures derived from genomic mutations of tumor founding clones (i.e., the ‘most recent common ancestor’ of the cells within a tumor) significantly distinguish the recurred and non-recurred breast tumors as well as outperform the most popular genomic test (i.e., Oncotype DX). These results imply that mutations of the tumor founding clones are associated with tumor recurrence and can be used to predict clinical outcomes. As such, predictive tools could be used in clinics to guide treatment routes. Finally, the concepts underlying the eTumorMetastasis pave the way for the application of genome sequencing in predictions for other complex genetic diseases. eTumorMetastasis pseudocode and related data used in this study are available at
基因组测序成本的不断降低正在扩大用于常规临床应用的基因组测序数据的可行性。但是,使用这些数据集构建预测模型仍然不成熟。我们提出了一种新的算法eTumorMetastasis,它将肿瘤基因组测序的功能突变转化为基于分子网络的信号,从而鉴定分子网络生物标记物。 分子网络生物标记物描述了肿瘤细胞从非复发的状态转变成有利于肿瘤复发的状态的临界点。结果显示,源自肿瘤发生亚克隆的基因组突变(即肿瘤内细胞的“最新共同祖先”)的分子网络生物标记物显着区分了复发和未复发的乳腺肿瘤,其表现好于最流行的基因组测试生物标记物(如,Oncotype DX乳腺癌复发评分)。因此,我们认为分子网络生物标记物可以在临床中使用预测工具来指导治疗途径,以改善患者预后。最后,eTumorMetastasis算法的概念为将基因组测序应用于构建其它复杂遗传疾病的预测模型铺平了道路。

Page 973-985


Identifying Novel Drug Targets by iDTPnd: A Case Study of Kinase Inhibitors

Hammad Naveed, Corinna Reglin, Thomas Schubert, Xin Gao, Stefan T. Arold, Michael L. Maitland

Current FDA-approved kinase inhibitors cause diverse adverse effects, some of which are due to the mechanism-independent effects of these drugs. Identifying these mechanism-independent interactions could improve drug safety and support drug repurposing. Here, we develop iDTPnd (integrated Drug Target Predictor with negative dataset), a computational approach for large-scale discovery of novel targets for known drugs. For a given drug, we construct a positive structural signature as well as a negative structural signature that captures the weakly conserved structural features of drug-binding sites. To facilitate assessment of unintended targets, iDTPnd also provides a docking-based interaction score and its statistical significance. We confirm the interactions of sorafenib, imatinib, dasatinib, sunitinib, and pazopanib with their known targets at a sensitivity of 52% and a specificity of 55%. We also validate 10 predicted novel targets by using in vitro experiments. Our results suggest that proteins other than kinases, such as nuclear receptors, cytochrome P450, and MHC class I molecules, can also be physiologically relevant targets of kinase inhibitors. Our method is general and broadly applicable for the identification of protein–small molecule interactions, when sufficient drug–target 3D data are available. The code for constructing the structural signatures is available at

Page 986-997


QAUST: Protein Function Prediction Using Structure Similarity, Protein Interaction, and Functional Motifs

Fatima Zohra Smaili, Shuye Tian, Ambrish Roy, Meshari Alazmi, Stefan T. Arold, Srayanta Mukherjee, P. Scott Hefty, Wei Chen, Xin Gao

The number of available protein sequences in public databases is increasing exponentially. However, a significant percentage of these sequences lack functional annotation, which is essential for the understanding of how biological systems operate. Here, we propose a novel method, Quantitative Annotation of Unknown STructure (QAUST), to infer protein functions, specifically Gene Ontology (GO) terms and Enzyme Commission (EC) numbers. QAUST uses three sources of information: structure information encoded by global and local structure similarity search, biological network information inferred by protein–protein interaction data, and sequence information extracted from functionally discriminative sequence motifs. These three pieces of information are combined by consensus averaging to make the final prediction. Our approach has been tested on 500 protein targets from the Critical Assessment of Functional Annotation (CAFA) benchmark set. The results show that our method provides accurate functional annotation and outperforms other prediction methods based on sequence similarity search or threading. We further demonstrate that a previously unknown function of human tripartite motif-containing 22 (TRIM22) protein predicted by QAUST can be experimentally validated.

Page 998-1011


Computational Assessment of Protein–protein Binding Affinity by Reversely Engineering the Energetics in Protein Complexes

Bo Wang, Zhaoqian Su, Yinghao Wu

The cellular functions of proteins are maintained by forming diverse complexes. The stability of these complexes is quantified by the measurement of binding affinity, and mutations that alter the binding affinity can cause various diseases such as cancer and diabetes. As a result, accurate estimation of the binding stability and the effects of mutations on changes of binding affinity is a crucial step to understanding the biological functions of proteins and their dysfunctional consequences. It has been hypothesized that the stability of a protein complex is dependent not only on the residues at its binding interface by pairwise interactions but also on all other remaining residues that do not appear at the binding interface. Here, we computationally reconstruct the binding affinity by decomposing it into the contributions of interfacial residues and other non-interfacial residues in a protein complex. We further assume that the contributions of both interfacial and non-interfacial residues to the binding affinity depend on their local structural environments such as solvent-accessible surfaces and secondary structural types. The weights of all corresponding parameters are optimized by Monte-Carlo simulations. After cross-validation against a large-scale dataset, we show that the model not only shows a strong correlation between the absolute values of the experimental and calculated binding affinities, but can also be an effective approach to predict the relative changes of binding affinity from mutations. Moreover, we have found that the optimized weights of many parameters can capture the first-principle chemical and physical features of molecular recognition, therefore reversely engineering the energetics of protein complexes. These results suggest that our method can serve as a useful addition to current computational approaches for predicting binding affinity and understanding the molecular mechanism of protein–protein interactions.

Page 1012-1022

Application Note

TSUNAMI: Translational Bioinformatics Tool Suite for Network Analysis and Mining

Zhi Huang, Zhi Han, Tongxin Wang, Wei Shao, Shunian Xiang, Paul Salama, Maher Rizkalla, Kun Huang, Jie Zhang

Gene co-expression network (GCN) mining identifies gene modules with highly correlated expression profiles across samples/conditions. It enables researchers to discover latent gene/molecule interactions, identify novel gene functions, and extract molecular features from certain disease/condition groups, thus helping to identify disease biomarkers. However, there lacks an easy-to-use tool package for users to mine GCN modules that are relatively small in size with tightly connected genes that can be convenient for downstream gene set enrichment analysis, as well as modules that may share common members. To address this need, we developed an online GCN mining tool package: TSUNAMI (Tools SUite for Network Analysis and MIning). TSUNAMI incorporates our state-of-the-art lmQCM algorithm to mine GCN modules for both public and user-input data (microarray, RNA-seq, or any other numerical omics data), and then performs downstream gene set enrichment analysis for the identified modules. It has several features and advantages: 1) a user-friendly interface and real-time co-expression network mining through a web server; 2) direct access and search of NCBI Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) databases, as well as user-input gene expression matrices for GCN module mining; 3) multiple co-expression analysis tools to choose from, all of which are highly flexible in regards to parameter selection options; 4) identified GCN modules are summarized to eigengenes, which are convenient for users to check their correlation with other clinical traits; 5) integrated downstream Enrichr enrichment analysis and links to other gene set enrichment tools; and 6) visualization of gene loci by Circos plot in any step of the process. The web service is freely accessible through URL: Source code is available at
挖掘基因共表达网络来寻找共表达基因模块能够让生物研究者发现基因间通常隐藏的联系和相互影响,帮助研究者确认新的基因功能,找到共同的上游调控基因,或者发现疾病的新的生物标记物。目前,缺乏一种简便易用的网上基因共表达模块挖掘工具,让不熟悉编程的生物研究者一步步实现对个人数据和公共数据的基因共表达模块的挖掘, 找到局部的小规模基因共表达模块,并对其进行后续功能性分析,而且允许模块之间互相有重叠(目前常用的基因共表达模块挖掘工具WGCNA不允许模块之间有重叠,这与基因通常参与多种功能和多个信号通路相矛盾)。为了解决这个需求,我们建立了TSUNAMI网上工具包(Tool Suite for Network Analysis and Mining),利用我们此前开发的共表达挖掘算法lmQCM来分析挖掘基因共表达网络,通过简单明了的菜单点击实现上述功能。这个工具包的特点和优势如下:1)对用户友好的界面,不需要了解任何关于数据挖掘和编程的知识,只需简单的点击就可以实现全部分析;2)直通连接NCBI GEO数据库和TCGA数据库,实现直接查询和数据处理挖掘,也可以分析用户上传的数据;3)可以选择使用lmQCM或者WGCNA两种算法来分析挖掘,用户自主选择;4)挖掘出的模块会自动计算并输出特征基因(相当于模块水平的基因表达值),方便后续的不同组别的差异分析和相对于样品其他性质的分析;5)整合了下游的基因功能性分析和上游调控机制分析;6)基因模块可以用circos图显示,各步骤结果可以下载。整个工具包面向互联网公众免费使用。网址:。源代码可于获取。

Page 1023-1031

Application Note

BrcaSeg: A Deep Learning Approach for Tissue Quantification and Genomic Correlations of Histopathological Images

Zixiao Lu, Xiaohui Zhan, Yi Wu, Jun Cheng, Wei Shao, Dong Ni, Zhi Han, Jie Zhang, Qianjin Feng, Kun Huang

Epithelial and stromal tissues are components of the tumor microenvironment and play a major role in tumor initiation and progression. Distinguishing stroma from epithelial tissues is critically important for spatial characterization of the tumor microenvironment. Here, we propose BrcaSeg, an image analysis pipeline based on a convolutional neural network (CNN) model to classify epithelial and stromal regions in whole-slide hematoxylin and eosin (H&E) stained histopathological images. The CNN model is trained using well-annotated breast cancer tissue microarrays and validated with images from The Cancer Genome Atlas (TCGA) Program. BrcaSeg achieves a classification accuracy of 91.02%, which outperforms other state-of-the-art methods. Using this model, we generate pixel-level epithelial/stromal tissue maps for 1000 TCGA breast cancer slide images that are paired with gene expression data. We subsequently estimate the epithelial and stromal ratios and perform correlation analysis to model the relationship between gene expression and tissue ratios. Gene Ontology (GO) enrichment analyses of genes that are highly correlated with tissue ratios suggest that the same tissue is associated with similar biological processes in different breast cancer subtypes, whereas each subtype also has its own idiosyncratic biological processes governing the development of these tissues. Taken all together, our approach can lead to new insights in exploring relationships between image-based phenotypes and their underlying genomic events and biological processes for all types of solid tumors. BrcaSeg can be accessed at

Page 1032-1042

Application Note

rePROBE: Workflow for Revised Probe Assignment and Updated Probe-set Annotation in Microarrays

Frieder Hadlich, Henry Reyer, Michael Oster, Nares Trakooljul, Eduard Muráni, Siriluck Ponsuksili, Klaus Wimmers

Commercial and customized microarrays are valuable tools for the analysis of holistic expression patterns, but require the integration of the latest genomic information. This study provides a comprehensive workflow implemented in an R package (rePROBE) to assign the entire probes and to annotate the probe sets based on up-to-date genomic and transcriptomic information. The rePROBE package can be applied to available gene expression microarray platforms and addresses both public and custom databases. The revised probe assignment and updated probe-set annotation are applied to commercial microarrays available for different livestock species, i.e., chicken (Gallus gallus; ChiGene-1_0-st: 443,579 probes and 18,530 probe sets), pig (Sus scrofa; PorGene-1_1-st: 592,005 probes and 25,779 probe sets), and cattle (Bos Taurus; BovGene-1_0-st: 530,717 probes and 24,759 probe sets), as well as available for human (Homo sapiens; HuGene-1_0-st) and mouse (Mus musculus; HT_MG-430_PM). Using current species-specific transcriptomic information (RefSeq, Ensembl, and partially non-redundant nucleotide sequences) and genomic information, the applied workflow reveals 297,574 probes (15,689 probe sets) for chicken, 384,715 probes (21,673 probe sets) for pig, 363,077 probes (21,238 probe sets) for cattle, 481,168 probes (23,495 probe sets) for human, and 324,942 probes (32,494 probe sets) for mouse. These are representative of 12,641, 15,758, 18,046, 20,167, and 16,335 unique genes that are both annotated and positioned for chicken, pig, cattle, human, and mouse, respectively. Additionally, the workflow collects information on the number of single nucleotide polymorphisms (SNPs) within respective targeted genomic regions and thus provides a detailed basis for comprehensive analyses such as expression quantitative trait locus (eQTL) studies to identify quantitative and functional traits. The rePROBE R package is freely available at

Page 1043-1049