Article Online

Articles Online (Volume 16, Issue 5)

Editorial

A Scientist Guerilla Fighter in the Frontiers of Bioinformatics—In Memory of Bailin Hao

Jun Yu

Page 307-309


Original Research

Polyphyly in 16S rRNA-based LVTree Versus Monophyly in Whole-genome-based CVTree

Guanghong Zuo, Ji Qi, Bailin Hao

We report an important but long-overlooked manifestation of low-resolution power of 16S rRNA sequence analysis at the species level, namely, in 16S rRNA-based phylogenetic trees polyphyletic placements of closely-related species are abundant compared to those in genome-based phylogeny. This phenomenon makes the demarcation of genera within many families ambiguous in the 16S rRNA-based taxonomy. In this study, we reconstructed phylogenetic relationship for more than ten thousand prokaryote genomes using the CVTree method, which is based on whole-genome information. And many such genera, which are polyphyletic in 16S rRNA-based trees, are well resolved as monophyletic clusters by CVTree. We believe that with genome sequencing of prokaryotes becoming a commonplace, genome-based phylogeny is doomed to play a definitive role in the construction of a natural and objective taxonomy.
CVTree方法是我们课题组开发的,基于全基因组的原核生物亲缘关系与分类的研究方法。它无需序列比对且具有种以下分辨能力。在本文中,我们利用CVTree网络服务器,基于“万量级”的原核生物全基因组构建亲缘关系树。系统研究了十个科与一个目的下级分类单元的亲缘关系,并将其与基于16S rRNA亲缘关系树的研究结果进行一一比对。我们发现,虽然当前分类系统越来越多的依赖基于16S rRNA 序列分析的结果,但是基于全基因组的CVTree方法的结果与分类系统更加一致,特别是在种的级别上。这说明CVTree方法,合理利用了全基因组序列中物种信息,避免了基于16S rRNA序列联配方法,由于信息量有限而导致的种级别的分辨能力不足的缺陷。我们相信,随着基因组测序越来越便捷,原核生物的亲缘关系与分类系统的研究将需要越来越多的考虑基于全基因组的研究结果,而CVTree方法将是一件非常有力的研究工具,为原核生物学家在这些研究中提供帮助。

Page 310-319


Method

VASC: Dimension Reduction and Visualization of Single-cell RNA-seq Data by Deep Variational Autoencoder

Dongfang Wang, Jin Gu

Single-cell RNA sequencing (scRNA-seq) is a powerful technique to analyze the transcriptomic heterogeneities at the single cell level. It is an important step for studying cell sub-populations and lineages, with an effective low-dimensional representation and visualization of the original scRNA-Seq data. At the single cell level, the transcriptional fluctuations are much larger than the average of a cell population, and the low amount of RNA transcripts will increase the rate of technical dropout events. Therefore, scRNA-seq data are much noisier than traditional bulk RNA-seq data. In this study, we proposed the deep variational autoencoder for scRNA-seq data (VASC), a deep multi-layer generative model, for the unsupervised dimension reduction and visualization of scRNA-seq data. VASC can explicitly model the dropout events and find the nonlinear hierarchical feature representations of the original data. Tested on over 20 datasets, VASC shows superior performances in most cases and exhibits broader dataset compatibility compared to four state-of-the-art dimension reduction and visualization methods. In addition, VASC provides better representations for very rare cell populations in the 2D visualization. As a case study, VASC successfully re-establishes the cell dynamics in pre-implantation embryos and identifies several candidate marker genes associated with early embryo development. Moreover, VASC also performs well on a 10× Genomics dataset with more cells and higher dropout rate.
近年来,单细胞RNA测序技术(scRNA-seq)的迅速发展使得研究人员能够在单细胞层次上研究生物系统的转录异质性,这种信息通常难以通过传统的组学数据获得。然而,在单细胞层次上,转录组的随机波动会远远大于细胞群体的平均行为,另一方面,单个细胞的RNA总量极低,使得其准确测量极具挑战,因此目前的单细胞测序数据存在很大的噪声。其中,dropout现象是一种主要的噪声,即很多表达的mRNA没有被捕捉到,导致检测出来的表达量为0。有效的低维表示可以降低scRNA-seq数据中的噪声,从而使得我们能够更好的分析细胞类型与状态,并实现细胞分布的可视化展示。本研究中,我们提出了一种基于深度变分自编码器的scRNA-seq数据分析方法——VASC,有效实现scRNA-seq数据的非监督降维与可视化。VASC对dropout现象进行了建模,并通过深度神经网络发现数据中复杂的非线性模式、降低数据噪声,从而做到可靠的数据降维与可视化。我们在超过20个数据集上(包含目前主流的scRNA-seq技术,例如SMART-Seq,inDrop,10X等)测试了VASC的低维表示性能,结果表明在大多数数据集中,VASC都能更好的提取细胞类型或者细胞分化过程的信息,体现了VASC广泛的适应性。VASC可以通过[通过[https://github.com/wang-research/VASC]免费获]免费获得。

Page 320-331


Method

TELS: A Novel Computational Framework for Identifying Motif Signatures of Transcribed Enhancers

Dimitrios Kleftogiannis, Haitham Ashoor, Vladimir B. Bajic

In mammalian cells, transcribed enhancers (TrEns) play important roles in the initiation of gene expression and maintenance of gene expression levels in a spatiotemporal manner. One of the most challenging questions is how the genomic characteristics of enhancers relate to enhancer activities. To date, only a limited number of enhancer sequence characteristics have been investigated, leaving space for exploring the enhancers’ DNA code in a more systematic way. To address this problem, we developed a novel computational framework, Transcribed Enhancer Landscape Search (TELS), aimed at identifying predictive cell type/tissue-specific motif signatures of TrEns. As a case study, we used TELS to compile a comprehensive catalog of motif signatures for all known TrEns identified by the FANTOM5 consortium across 112 human primary cells and tissues. Our results confirm that combinations of different short motifs characterize in an optimized manner cell type/tissue-specific TrEns. Our study is the first to report combinations of motifs that maximize classification performance of TrEns exclusively transcribed in one cell type/tissue from TrEns exclusively transcribed in different cell types/tissues. Moreover, we also report 31 motif signatures predictive of enhancers’ broad activity. TELS codes and material are publicly available at http://www.cbrc.kaust.edu.sa/TELS.
在哺乳动物细胞中,转录增强子(TrEns)在起始基因表达和维持基因表达水平的过程中发挥着重要作用。由于增强子与启动子在转录激活方面行使功能的相似性,科学家很难将增强子和启动子清晰地分开。因此,破译增强子的基因组特征有助于更好地理解增强子的功能以及增强子与启动子间的差异。到目前为止,对于增强子序列特征的研究十分匮乏。为了解决这个问题,我们开发了一种基于机器学习的方法——转录增强子模式搜索(TELS),用以识别人类基因组中转录增强子短序列motif最优的组合模式,该方法使用逻辑回归(LR)和降维算法。在本研究中,我们使用由CAGE实验确定的涵盖了112个人类初级细胞和组织的转录增强子的序列作为训练样本。该方法只利用已知转录增强子的motifs序列信息,而不需要增强子的活性信息,就可以区分细胞特异性增强子以及非特异性增强子。此外,我们还发现了具有非细胞特异性增强子活性预测性能的31个motif信号。TELS代码和数据可在http://www.cbrc.kaust.edu.sa/TELS 上公开获取。

Page 332-341


Method

TICA: Transcriptional Interaction and Coregulation Analyzer

Stefano Perna, Pietro Pinoli, Stefano Ceri, Limsoon Wong

Transcriptional regulation is critical to cellular processes of all organisms. Regulatory mechanisms often involve more than one transcription factor (TF) from different families, binding together and attaching to the DNA as a single complex. However, only a fraction of the regulatory partners of each TF is currently known. In this paper, we present the Transcriptional Interaction and Coregulation Analyzer (TICA), a novel methodology for predicting heterotypic physical interaction of TFs. TICA employs a data-driven approach to infer interaction phenomena from chromatin immunoprecipitation and sequencing (ChIP-seq) data. Its prediction rules are based on the distribution of minimal distance couples of paired binding sites belonging to different TFs which are located closest to each other in promoter regions. Notably, TICA uses only binding site information from input ChIP-seq experiments, bypassing the need to do motif calling on sequencing data. We present our method and test it on ENCODE ChIP-seq datasets, using three cell lines as reference including HepG2, GM12878, and K562. TICA positive predictions on ENCODE ChIP-seq data are strongly enriched when compared to protein complex (CORUM) and functional interaction (BioGRID) databases. We also compare TICA against both motif/ChIP-seq based methods for physical TF–TF interaction prediction and published literature. Based on our results, TICA offers significant specificity (average 0.902) while maintaining a good recall (average 0.284) with respect to CORUM, providing a novel technique for fast analysis of regulatory effect in cell lines. Furthermore, predictions by TICA are complementary to other methods for TF–TF interaction prediction (in particular, TACO and CENTDIST). Thus, combined application of these prediction tools results in much improved sensitivity in detecting TF–TF interactions compared to TICA alone (sensitivity of 0.526 when combining TICA with TACO and 0.585 when combining with CENTDIST) with little compromise in specificity (specificity 0.760 when combining with TACO and 0.643 with CENTDIST). TICA is publicly available at http://geco.deib.polimi.it/tica/.
对于生物体来说,转录调控是细胞活动至关重要的部分。而转录因子(TF)是参与基因转录起始和调控的蛋白质。通常情况下,转录调控作用不只是由一个转录因子完成,而是涉及到来自不同家族的多个转录因子,他们会结合在一起,形成一个复合物附着在DNA上,从而影响基因的转录活动。然而,目前仅有小部分TF相互作用关系被人类所知晓。因此,本文提出了一种新的预测TF物理相互作用的方法——转录互作及共调控关系分析(TICA)。TICA利用染色质免疫共沉淀技术测序(ChIP-seq)数据,并从中推断TF间相互作用。这些相互作用包括TFs之间的直接绑定、同一复合体中没有直接接触的TFs间相互作用、以及相互作用后阻碍其他TFs与另一半结合的相互作用。该方法的预测原理是,在启动子区域内寻找相邻两个TF的基因组距离显著小于随机TF组合的基因组距离作为预测结果。值得一提的是,TICA只使用来自ChIP-seq实验的数据作为输入信息,而不需要对数据进行motif预测。在这项研究中,我们使用HepG2、GM12878、K562三种细胞系的ENCODE ChIP-seq 数据集对软件进行测试,并将预测结果与蛋白质复合物(CORUM)以及功能相互作用(BioGRID)数据库中的数据进行比较,比较结果说明TICA预测结果具有很高的准确性。同时,我们也将TICA与其他预测TF互作的方法进行比较。比较结果显示TICA在保持与CORUM较好地一致性(平均0.284)的同时,也具有很高的特异性(平均0.902),因此TICA可以作为一种快速分析细胞系的调控网络的新方法或其他预测TF相互作用软件的补充。TICA可以在 http://geco.deib.polimi.it/tica/ 上公开获取。

Page 342-353


Method

Machine Learning Models for Genetic Risk Assessment of Infants with Non-syndromic Orofacial Cleft

Shi-Jian Zhang, Peiqi Meng, Jieni Zhang, Peizeng Jia, Jiuxiang Lin, Xiangfeng Wang, Feng Chen, Xiaoxing Wei

The isolated type of orofacial cleft, termed non-syndromic cleft lip with or without cleft palate (NSCL/P), is the second most common birth defect in China, with Asians having the highest incidence in the world. NSCL/P involves multiple genes and complex interactions between genetic and environmental factors, imposing difficulty for the genetic assessment of the unborn fetus carrying multiple NSCL/P-susceptible variants. Although genome-wide association studies (GWAS) have uncovered dozens of single nucleotide polymorphism (SNP) loci in different ethnic populations, the genetic diagnostic effectiveness of these SNPs requires further experimental validation in Chinese populations before a diagnostic panel or a predictive model covering multiple SNPs can be built. In this study, we collected blood samples from control and NSCL/P infants in Han and Uyghur Chinese populations to validate the diagnostic effectiveness of 43 candidate SNPs previously detected using GWAS. We then built predictive models with the validated SNPs using different machine learning algorithms and evaluated their prediction performance. Our results showed that logistic regression had the best performance for risk assessment according to the area under curve. Notably, defective variants in MTHFR and RBP4, two genes involved in folic acid and vitamin A biosynthesis, were found to have high contributions to NSCL/P incidence based on feature importance evaluation with logistic regression. This is consistent with the notion that folic acid and vitamin A are both essential nutritional supplements for pregnant women to reduce the risk of conceiving an NSCL/P baby. Moreover, we observed a lower predictive power in Uyghur than in Han cases, likely due to differences in genetic background between these two ethnic populations. Thus, our study highlights the urgency to generate the HapMap for Uyghur population and perform resequencing-based screening of Uyghur-specific NSCL/P markers.
唇腭裂是口腔颌面部最常见的出生缺陷之一,发病率因种族差异有所不同,以东亚人最高(1/500)。此外,我国的第二大少数民族——维吾尔族唇腭裂的发病率(1.96/1000)亦高于中国人的平均水平(1.42/1000)。唇腭裂病因复杂,既有遗传因素(如IRF6基因的变异)又有环境因素(如母亲营养状态、烟、酒精等)的作用,给唇腭裂的遗传风险评估造成较大困难。截至目前已经有多个全基因组关联分析(genome-wide association study, GWAS)发现了一些唇腭裂相关的单核苷酸多态性(single nucleotide polymorphism, SNP)位点, 但是每一单个位点对于唇腭裂的遗传贡献率尚不清楚。本研究中,我们收集了103例汉族患者、279例维族患者、504例汉族对照和205例维族对照。从截至2017年12月发表在高水平杂志上的6篇唇腭裂相关GWAS文章中共计筛选出43个唇腭裂相关的SNP位点,分别检测每位受试者这43个SNP位点的基因型,利用不同的机器学习算法分别在维族和汉族人群中构建唇腭裂发病风险预测模型,并比较各种算法的预测效力。我们发现:在七种算法中,logistic regression的预测效果最好,在汉族人群中受试者工作特征曲线的曲线下面积(area under the curve, AUC)可达0.90,但在维族人群中模型的预测效力则低于汉族,AUC值仅为0.64。通过在构建模型过程中逐步递增和逐步移除这43个SNP位点,我们进一步筛选出6个位点,利用这6个位点构建的模型对汉族人群唇腭裂的发病风险也能达到较好的预测效果,AUC值为0.87。在这6个SNP位点中,有4个与营养代谢相关,其中包括位于叶酸代谢相关基因MTHFR编码区的rs1801133 和rs1801131,以及维生素A转运相关蛋白RBP4基因非编码区的rs10882272。由此可见,通过机器学习方法利用较少的SNP位点构建模型,可对汉族人群唇腭裂发病风险达到较好的预测效果,该模型可能具有一定的临床应用前景,但仍需在更多的人群中进一步验证。此外,营养代谢相关基因的变异在唇腭裂的发生中可能起重要作用。我们猜测:对于备孕期及孕早期妇女,尤其是携带有相应缺陷基因者,针对其不同突变基因型,个体化补充相应剂量的叶酸或/和维生素A或许能降低胎儿患唇腭裂的风险。

Page 354-364


Application Note

GITAR: An Open Source Tool for Analysis and Visualization of Hi-C Data

Riccardo Calandrelli, Qiuyang Wu, Jihong Guan, Sheng Zhong

Interactions between chromatin segments play a large role in functional genomic assays and developments in genomic interaction detection methods have shown interacting topological domains within the genome. Among these methods, Hi-C plays a key role. Here, we present the Genome Interaction Tools and Resources (GITAR), a software to perform a comprehensive Hi-C data analysis, including data preprocessing, normalization, and visualization, as well as analysis of topologically-associated domains (TADs). GITAR is composed of two main modules: (1) HiCtool, a Python library to process and visualize Hi-C data, including TAD analysis; and (2) processed data library, a large collection of human and mouse datasets processed using HiCtool. HiCtool leads the user step-by-step through a pipeline, which goes from the raw Hi-C data to the computation, visualization, and optimized storage of intra-chromosomal contact matrices and TAD coordinates. A large collection of standardized processed data allows the users to compare different datasets in a consistent way, while saving time to obtain data for visualization or additional analyses. More importantly, GITAR enables users without any programming or bioinformatic expertise to work with Hi-C data. GITAR is publicly available at http://genomegitar.org as an open-source software.
染色质片段之间的相互作用在功能基因组测定中起重要作用,并且基因组相互作用检测方法的发展已显示基因组内的存在相互作用的拓扑结构域。在这些方法中,Hi-C是起着关键作用的一种重要方法。在这里,我们介绍基因组交互工具和资源平台(GITAR),这是一个执行全面的Hi-C数据分析的软件,软件包括数据预处理,标准化和可视化,以及拓扑相关域(TAD)的分析。 GITAR由两个主要模块组成:(1)HiCtool,一个处理和可视化Hi-C数据的Python库,包括拓扑相关域分析; (2)处理过的Hi-C数据库,使用HiCtool处理的大量人类和小鼠数据集。 HiCtool逐步引导用户完成处理流程,该流程完成从原始Hi-C数据到染色体内接触矩阵和拓扑相关域坐标的计算,可视化,到优化存储整个过程。大量标准化处理数据允许用户以一致的方式比较不同的数据集,同时节省获取可视化或其他分析数据的时间。更重要的是,GITAR可以使没有任何编程或生物信息学专业知识的用户能够使用Hi-C数据。 GITAR在http://genomegitar.org上作为开源软件公开发布。

Page 365-372


Application Note

RGAAT: A Reference-based Genome Assembly and Annotation Tool for New Genomes and Upgrade of Known Genomes

Wanfei Liu, Shuangyang Wu, Qiang Lin, Shenghan Gao, Feng Ding, Xiaowei Zhang, Hasan Awad Aljohi, Jun Yu, Songnian Hu

The rapid development of high-throughput sequencing technologies has led to a dramatic decrease in the money and time required for de novo genome sequencing or genome resequencing projects, with new genome sequences constantly released every week. Among such projects, the plethora of updated genome assemblies induces the requirement of version-dependent annotation files and other compatible public dataset for downstream analysis. To handle these tasks in an efficient manner, we developed the reference-based genome assembly and annotation tool (RGAAT), a flexible toolkit for resequencing-based consensus building and annotation update. RGAAT can detect sequence variants with comparable precision, specificity, and sensitivity to GATK and with higher precision and specificity than Freebayes and SAMtools on four DNA-seq datasets tested in this study. RGAAT can also identify sequence variants based on cross-cultivar or cross-version genomic alignments. Unlike GATK and SAMtools/BCFtools, RGAAT builds the consensus sequence by taking into account the true allele frequency. Finally, RGAAT generates a coordinate conversion file between the reference and query genomes using sequence variants and supports annotation file transfer. Compared to the rapid annotation transfer tool (RATT), RGAAT displays better performance characteristics for annotation transfer between different genome assemblies, strains, and species. In addition, RGAAT can be used for genome modification, genome comparison, and coordinate conversion. RGAAT is available at https://sourceforge.net/projects/rgaat/ and https://github.com/wushyer/RGAAT_v2 at no cost.
在处理水稻相关品系重测序数据的过程中,研究人员发现不同品系水稻转录组的映射率和重构转录本的数量存在巨大差异,说明研究过程中不能用同一个参考基因组解决所有品系的问题。因此,本研究开发了一种基于重测序数据的基因组一致性序列构建、变异鉴定和注释转移的工具RGAAT(Reference Based Genome Assembly and Annotation Tool)。RGAAT可以通过处理基因组序列、注释文件(GTF, GFF, GFF3和BED)配合映射文件(SAM/BAM)或变异文件(VCF),获得更新的基因组序列和注释转移文件。与GATK和SAMtools/BCFtools不同,RGAAT考虑真实的等位基因频率构建一致性序列。RGAAT还可以鉴定基因组变异。在四组重测序测试数据中,RGAAT变异检测的准确性与特异性和GATK相当,高于Freebayes和SAMtools。RGAAT还可以基于品种之间基因组的比对信息鉴定变异。基于序列变异文件,RGAAT计算出输入序列与参考基因组之间的坐标转换文件和注释转移文件。测试数据结果表明RGAAT比现有注释转移工具RATT的转移性能更好。RGAAT已上传至Sourceforge (https://sourceforge.net/projects/rgaat/)和Github (https://github.com/wushyer/RGAAT_v2),研究人员可免费下载使用。

Page 373-381


Application Note

AncestryPainter: A Graphic Program for Displaying Ancestry Composition of Populations and Individuals

Qidi Feng, Dongsheng Lu, Shuhua Xu

Ancestry composition of populations and individuals has been extensively investigated in recent years due to advances in the genotyping and sequencing technologies. As the number of populations and individuals used for ancestry inference increases remarkably, say more than 100 populations or 1000 individuals, it is usually challenging to present the ancestry composition in a traditional way using a rectangular graph. To address this issue, we developed a program, AncestryPainter, which can illustrate the ancestry composition of populations and individuals with a rounded and nice-looking graph to save space. Individuals are depicted as length-fixed bars partitioned into colored segments representing different ancestries, and the population of interest can be highlighted as a pie chart in the center of the circle plot. In addition, AncestryPainter can also be applied to display personal ancestry in a way similar to that for displaying population ancestry. AncestryPainter is publicly available at http://www.picb.ac.cn/PGG/resource.php.
近年来随着DNA测序技术的飞速发展和测序成本的持续降低,大规模地解析群体和个体的遗传背景和祖源祖源构成已成为切实可行和正在广泛开展的实践。然而,随着群体数量和样本量的持续增加,对祖源分析结果的展示,尤其是以正式发表为目标的图形展示成为一个挑战。为了全面地、一目了然地将成百上千甚至上万的群体和个体基因组的祖源清晰有效的展示在一个视野(印刷页面)上, 我们专门开发了一个新的图形化展示软件,AncestryPainter。该方法将每个个体依据祖源构成由不同颜色图形组成,所有个体紧凑排列形成一个环形,用户特别关注的群体则以饼图(Pie Chart)形式展示在环形中间,因而在最大化地节省了空间的同时更全面清晰地展示祖源分析结果。 这种展示方式可以有效的展示更多的群体的祖源构成,并且更加方便群体之间的祖源构成比较。该作图软件在http://www.picb.ac.cn/PGG/resource.php可自由下载。

Page 382-385