Articles Online (Volume 17, Issue 5)

Editorial

In Memory of Vladimir B. Bajic (1952–2019)

Zhang Zhang, Jun Yu, Frank Eisenhaber, Xin Gao, Takashi Gojobori

Page 473-474


Research Highlight

Deep Learning Deciphers Protein–RNA Interaction

Ming Li

蛋白质-RNA相互作用在细胞中普遍存在,是转录后调控的主要机制。RNA结合蛋白(RBPs)通过控制多层次的基因调控,不仅控制着哪些转录物被翻译,还决定着mRNA翻译的速度、位置和浓度。碱基为主的相互作用和骨架为主的相互作用是RBPs与RNA相互作用的两种主要方式。获得蛋白质与RNA相互作用的方法主要有两种:实验技术和计算方法。实验方法主要包括基于高通量测序的方法以及结构生物学方法,而这两种方法均具有明显的劣势。至于计算的方法,仍旧处于不成熟的阶段。现存基于深度学习的计算方法大都采取二项式预测,即仅仅预测RBP的一个亚基是否为一个结合域。这样的方法通常具有高假阳性率。 由Gao Lab开发的NucleicNet,是一种通过深度学习方法在RNA水平上预测蛋白质-RNA互作的工具。该工具把蛋白质-RNA互作问题归结为七类分类问题,其标签集合包括非位点、核糖、磷酸盐和四种不同的碱基。对于任何一种深度学习方法,训练数据都是最关键的组成部分。Lam等人使用了包含蛋白质数据库(PDB)中所有已解决的蛋白质-RNA复合结构的数据集,并小心地去除多余结构和多余链,从而形成一个包含175个与RNA-蛋白质结合的数据集。针对已知RBPs的情况,NucleicNet可用于对任何给定的RNA结合序列进行评分,设计最优结合序列,并绘制序列标识。而对于具有未知RNA结合功能的蛋白质,NucleicNet可用于检查蛋白质是否具有合适的RNA结合位点,如果有,则进一步检测首选的RNA结合区域。同时,开发者还提供了一个方便公众使用NucleicNet的网站。

Page 475-477


Original Research

DeepCPI: A Deep Learning-based Framework for Large-scale in silico Drug Screening

Fangping Wan, Yue Zhu, Hailin Hu, Antao Dai, Xiaoqing Cai, Ligong Chen, Haipeng Gong, Tian Xia, Dehua Yang, Ming-Wei Wang, Jianyang Zeng

Accurate identification of compound–protein interactions (CPIs) in silico may deepen our understanding of the underlying mechanisms of drug action and thus remarkably facilitate drug discovery and development. Conventional similarity- or docking-based computational methods for predicting CPIs rarely exploit latent features from currently available large-scale unlabeled compound and protein data and often limit their usage to relatively small-scale datasets. In the present study, we propose DeepCPI, a novel general and scalable computational framework that combines effective feature embedding (a technique of representation learning) with powerful deep learning methods to accurately predict CPIs at a large scale. DeepCPI automatically learns the implicit yet expressive low-dimensional features of compounds and proteins from a massive amount of unlabeled data. Evaluations of the measured CPIs in large-scale databases, such as ChEMBL and BindingDB, as well as of the known drug–target interactions from DrugBank, demonstrated the superior predictive performance of DeepCPI. Furthermore, several interactions among small-molecule compounds and three G protein-coupled receptor targets (glucagon-like peptide-1 receptor, glucagon receptor, and vasoactive intestinal peptide receptor) predicted using DeepCPI were experimentally validated. The present study suggests that DeepCPI is a useful and powerful tool for drug discovery and repositioning. The source code of DeepCPI can be downloaded from https://github.com/FangpingWan/DeepCPI.
通过计算机准确识别化合物-蛋白质相互作用(CPI)能够加深我们对药物作用机制的理解,从而促进药物发现与研发。基于相似度或者对接的传统CPI预测方法通常利用小规模标注数据集的信息而很少利用大量无标签化合物和蛋白质数据中的潜在特征信息。在本研究中,我们提出了一种新颖的CPI预测模型DeepCPI。 DeepCPI通过特征嵌入技术(一种表示学习方法),自动从大规模未标记数据中学习化合物与蛋白质特征表示,并进一步通过结合深度学习技术,精确并大规模的预测CPI。在大规模CPI数据集(ChEMBL,BindingDB和DrugBank)上的计算实验与评估表明,DeepCPI拥有优越的预测性能。除此之外,我们通过湿实验进一步验证了DeepCPI方法预测的多个小分子与三种GPCR蛋白(GLP-1R, GCGR和VIPR)之间的互作关系。综合以上结果,DeepCPI是一个对于药物发现和重定位有效的工具。

Page 478-495


Method

I3: A Self-organising Learning Workflow for Intuitive Integrative Interpretation of Complex Genetic Data

Yun Tan, Lulu Jiang, Kankan Wang, Hai Fang

We propose a computational workflow (I3) for intuitive integrative interpretation of complex genetic data mainly building on the self-organising principle. We illustrate the use in interpreting genetics of gene expression and understanding genetic regulators of protein phenotypes, particularly in conjunction with information from human population genetics and/or evolutionary history of human genes. We reveal that loss-of-function intolerant genes tend to be depleted of tissue-sharing genetics of gene expression in brains, and if highly expressed, have broad effects on the protein phenotypes studied. We suggest that this workflow presents a general solution to the challenge of complex genetic data interpretation. I3 is available at http://suprahex.r-forge.r-project.org/I3.html.
我们提出了一种基于自组织原理的复杂遗传数据直观整合解析计算流程(I3)。应用于该流程,特别是结合人类群体遗传学、人类基因进化史的信息,我们解析了基因表达遗传调控、蛋白质表型遗传调控。我们发现,功能丧失不耐受基因在大脑中不受控于基因表达遗传调控;如果高表达,对已研究蛋白质表型产生了广泛影响。我们认为,该流程将为复杂遗传数据解析挑战提供了一个通用的解决方案。I3可从http://suprahex.r-forge.r-project.org/I3.html获得。

Page 503-510


Application Note

CIRCexplorer3: A CLEAR Pipeline for Direct Comparison of Circular and Linear RNA Expression

Xu-Kai Ma, Meng-Ran Wang, Chu-Xiao Liu, Rui Dong, Gordon G. Carmichael, Ling-Ling Chen, Li Yang

Sequences of circular RNAs (circRNAs) produced from back-splicing of exon(s) completely overlap with those from cognate linear RNAs transcribed from the same gene loci with the exception of their back-splicing junction (BSJ) sites. Therefore, examination of global circRNA expression from RNA-seq datasets generally relies on the detection of RNA-seq fragments spanning BSJ sites, which is different from the quantification of linear RNA expression by normalized RNA-seq fragments mapped to whole gene bodies. Thus, direct comparison of circular and linear RNA expression from the same gene loci in a genome-wide manner has remained challenging. Here, we update the previously-reported CIRCexplorer pipeline to version 3 for circular and linear RNA expression analysis from ribosomal-RNA depleted RNA-seq (CIRCexplorer3-CLEAR). A new quantitation parameter, fragments per billion mapped bases (FPB), is applied to evaluate circular and linear RNA expression individually by fragments mapped to circRNA-specific BSJ sites or to linear RNA-specific splicing junction (SJ) sites. Comparison of circular and linear RNA expression levels is directly achieved by dividing FPBcirc by FPBlinear to generate a CIRCscore, which indicates the relative circRNA expression level using linear RNA expression level as the background. Highly-expressed circRNAs with low cognate linear RNA expression background can be readily identified by CIRCexplorer3-CLEAR for further investigation. CIRCexplorer3-CLEAR is publically available at https://github.com/YangLab/CLEAR.
除了反向剪接位点,环形RNA与其对应的线形RNA在一级序列上完全重复,因此从转录组测序数据中系统发现环形RNA和线形RNA的策略不同。现有对环形RNA的全转录组检测和定量主要依赖对跨反向剪接位点读序的分析,这与线形RNA定量依赖对覆盖基因外显子和跨外显子读序的分析不一致,导致现有计算方法大多无法直接用来比较环形RNA与其对应线形RNA的表达,这是环形RNA全转录组分析的难点之一。在这项最新的研究中,我们升级开发了环形RNA与其对应线形RNA直接定量比较的分析系统CIRCexplorer3-CLEAR(circular and linear RNA expression analysis from ribosomal-RNA depleted RNA-seq, https://github.com/YangLab/CLEAR),用于开展环形RNA的精准定量研究。CIRCexplorer3-CLEAR使用统一的定量参数FPB(fragments per billion mapped bases),利用跨反向剪接位点和跨正常剪接位点的读序数据分别计算环形RNA (FPBcirc)和线形RNA(FPBlinear)的表达;并进一步通过直接比较环形RNA及其对应线形RNA的FPB值获得CIRCscore(FPBcirc vs FPBlinear),用来表征环形RNA相对于线形RNA的表达水平。相对于原有的FPM,新建立的FPB不受读序长度与测序策略的影响,而CIRCscore值则能进一步去除线形RNA的表达背景,因此基于FPB和CIRCscore的定量分析将有更广的适应性和更高的鲁棒性。利用CIRCexplorer3-CLEAR,研究人员可筛选获得相对于线形RNA高表达的环形RNA,并开展后续环形RNA功能及作用机制等研究。

Page 511-521


Application Note

CircAST: Full-length Assembly and Quantification of Alternatively Spliced Isoforms in Circular RNAs

Jing Wu, Yan Li, Cheng Wang, Yiqiang Cui, Tianyi Xu, Chang Wang, Xiao Wang, Jiahao Sha, Bin Jiang, Kai Wang, Zhibin Hu, Xuejiang Guo, Xiaofeng Song

Circular RNAs (circRNAs), covalently closed continuous RNA loops, are generated from cognate linear RNAs through back splicing events, and alternative splicing events may generate different circRNA isoforms at the same locus. However, the challenges of reconstruction and quantification of alternatively spliced full-length circRNAs remain unresolved. On the basis of the internal structural characteristics of circRNAs, we developed CircAST, a tool to assemble alternatively spliced circRNA transcripts and estimate their expression by using multiple splice graphs. Simulation studies showed that CircAST correctly assembled the full sequences of circRNAs with a sensitivity of 85.63%–94.32% and a precision of 81.96%–87.55%. By assigning reads to specific isoforms, CircAST quantified the expression of circRNA isoforms with correlation coefficients of 0.85–0.99 between theoretical and estimated values. We evaluated CircAST on an in-house mouse testis RNA-seq dataset with RNase R treatment for enriching circRNAs and identified 380 circRNAs with full-length sequences different from those of their corresponding cognate linear RNAs. RT-PCR and Sanger sequencing analyses validated 32 out of 37 randomly selected isoforms, thus further indicating the good performance of CircAST, especially for isoforms with low abundance. We also applied CircAST to published experimental data and observed substantial diversity in circular transcripts across samples, thus suggesting that circRNA expression is highly regulated. CircAST can be accessed freely at https://github.com/xiaofengsong/CircAST.
环形RNA(circRNA)具有共价闭合环状结构,由同源线性RNA通过反向剪接事件产生,且可由其内部可变剪接事件形成不同的环形转录本。然而环形RNA全长转录本的组装和定量问题仍未得到解决。根据环形RNA内部结构的特点,我们开发了一种工具CircAST,通过构建多剪接图模型来重建环形RNA转录本并估计其表达量。模拟结果表明CircAST在重建环形RNA转录本全长序列时敏感度为85.63%–94.32%,准确度为81.96%–87.55%。在转录本定量方面,CircAST通过算法把reads分配到具体的转录本来计算该转录本的表达丰度,可使估算值和真实值的相关系数达到0.85–0.99。我们还将CircAST应用于小鼠睾丸RNase R处理后的RNA-seq测序数据,结果发现380个环形RNA全长转录本与其同源的线性RNA转录本的全长序列不同。最后我们采用RT-PCR和Sanger测序对随机选择的37个环形转录本进行了验证,其中32个环形转录本成功验证,这说明CircAST具有较好的性能,特别是其针对低丰度的环形RNA转录本。我们还将CircAST应用于已公开发表的实验数据中,结果表明不同样本表达的环形RNA转录本具有显著多样性,提示其表达是受到高度调控的。CircAST可在以下网址免费下载使用:https://github.com/xiaofengsong/CircAST。

Page 522-534


Application Note

shinyChromosome: An R/Shiny Application for Interactive Creation of Non-circular Plots of Whole Genomes

Yiming Yu, Wen Yao, Yuping Wang, Fangfang Huang

Non-circular plots of whole genomes are natural representations of genomic data aligned along all chromosomes. Currently, there is no specialized graphical user interface (GUI) designed to produce non-circular whole genome diagrams, and the use of existing tools requires considerable coding effort from users. Moreover, such tools also require improvement, including the addition of new functionalities. To address these issues, we developed a new R/Shiny application, named shinyChromosome, as a GUI for the interactive creation of non-circular whole genome diagrams. shinyChromosome can be easily installed on personal computers for own use as well as on local or public servers for community use. Publication-quality images can be readily generated and annotated from user input using diverse widgets. shinyChromosome is deployed at http://150.109.59.144:3838/shinyChromosome/, http://shinyChromosome.ncpgr.cn, and https://yimingyu.shinyapps.io/shinyChromosome for online use. The source code and manual of shinyChromosome are freely available at https://github.com/venyao/shinyChromosome.
全基因组非圆形图是一种沿着所有染色体展示基因组数据的常用方法。到目前为止,仍然没有专门用于创建全基因组非圆形图的图形用户界面程序,而使用现有的非图形界面工具需要用户进行大量的编程工作。此外,现有工具仍需要进一步改进,包括一些新功能的添加等。为了解决这些问题,我们基于R软件的Shiny包开发了一款可用于交互式创建全基因组非圆形图的图形用户界面程序,程序名为shinyChromosome。shinyChromosome可以很容易地安装在个人计算机上供用户自己使用,也可以安装在本地或公共服务器上供其他很多用户使用。通过使用shinyChromosome图形界面程序中提供的很多小控件,用户可以很容易地生成具有发表质量的图形并对图形进行注释。shinyChromosome 被部署在 http://150.109.59.144:3838/shinyChromosome/,http://shinyChromosome.ncpgr.cn和 https://yimingyu.shinyapps.io/shinyChromosome等三个网址供用户在线使用。shinyChromosome的源代码和帮助手册可以在https://github.com/venyao/shinyChromosome免费获得。

Page 535-539


Application Note

VPOT: A Customizable Variant Prioritization Ordering Tool for Annotated Variants

Eddie Ip, Gavin Chapman, David Winlaw, Sally L. Dunwoodie, Eleni Giannoulatou

Next-generation sequencing (NGS) technologies generate thousands to millions of genetic variants per sample. Identification of potential disease-causal variants is labor intensive as it relies on filtering using various annotation metrics and consideration of multiple pathogenicity prediction scores. We have developed VPOT (variant prioritization ordering tool), a python-based command line tool that allows researchers to create a single fully customizable pathogenicity ranking score from any number of annotation values, each with a user-defined weighting. The use of VPOT can be informative when analyzing entire cohorts, as variants in a cohort can be prioritized. VPOT also provides additional functions to allow variant filtering based on a candidate gene list or by affected status in a family pedigree. VPOT outperforms similar tools in terms of efficacy, flexibility, scalability, and computational performance. VPOT is freely available for public use at GitHub (https://github.com/VCCRI/VPOT/). Documentation for installation along with a user tutorial, a default parameter file, and test data are provided.
随着第二代测序(NGS)技术的日益广泛应用,研究人员现在面临着从几十万到数百万的遗传变异数据。目前已有多种致病性预测算法,如CADD、PolyPhen-2、SIFT和MutationTaster2等,但还没有一种被普遍认为是最好的单一算法。因此,研究者通常采取使用多种方法进行预测,并认为由这些软件共同预测出的变异更具有致病性研究价值。这使得识别潜在疾病变异位点的工作量非常巨大。基于此,我们开发了VPOT(变异优先级排序工具),这是一个基于python语言的工具,它允许研究者从任意数量的注释值中自主创建一个致病性排名分数,其中每个值都具有用户所定义的权重。当分析多个队列时,使用VPOT可以提供更为有效的信息,因为用户可以选择任意一个队列中的变异被优先考虑。此外,VPOT还提供一个额外功能,即允许基于候选基因列表或受影响的家族谱系位点对变异进行筛选。相比于其他类似的工具,VPOT在效率、灵活性、可扩展性和计算性能方面均具有优势。用户可从GitHub (https://github.com/VCCRI/VPOT/) 中免费获得VPOT。我们同时附上安装说明、使用教程、默认参数文件以及测试数据。

Page 540-545


Application Note

MakeHub: Fully Automated Generation of UCSC Genome Browser Assembly Hubs

Katharina Jasmin Hoff

Novel genomes are today often annotated by small consortia or individuals whose background is not from bioinformatics. This audience requires tools that are easy to use. Such need has been addressed by several genome annotation tools and pipelines. Visualizing resulting annotation is a crucial step of quality control. The UCSC Genome Browser is a powerful and popular genome visualization tool. Assembly Hubs, which can be hosted on any publicly available web server, allow browsing genomes via UCSC Genome Browser servers. The steps for creating custom Assembly Hubs are well documented and the required tools are publicly available. However, the number of steps for creating a novel Assembly Hub is large. In some cases, the format of input files needs to be adapted, which is a difficult task for scientists without programming background. Here, we describe MakeHub, a novel command line tool that generates Assembly Hubs for the UCSC Genome Browser in a fully automated fashion. The pipeline also allows extending previously created Hubs by additional tracks. MakeHub is freely available for downloading at https://github.com/Gaius-Augustus/MakeHub.
随着测序成本的降低,越来越多的个体及小型研究团体可以负担得起对感兴趣的非模式生物的基因组测序费用。同时可以供具有不同背景的科学家使用的,在新基因组中注释蛋白质编码基因的工具已经被开发出来,例如,AUGUSTUS、GeneMark ES/ET、GlimmerHMM、SNAP和GeMoMa,以及BRAKER、WebAUGUSTUS和MAKER。这种基因预测工具的输出文件是一种类似表格的文本文件,格式为基因转换格式(GTF)或通用特征格式3(GFF3)。在所有基因组注释项目中,对预测的基因结构进行可视化是质量控制的关键步骤。许多基因组浏览器可实现基因组的可视化功能,例如UCSC基因组浏览器、JBrowse和GBrowse2。其中,UCSC基因组浏览器是一个功能强大且被科研工作者广泛使用的基因组可视化工具。装配hubs是一种可以托管在任意公共可用网络服务器上,允许通过UCSC基因组浏览器服务器来浏览基因组。目前,对于创建自定义装配hubs的步骤方法已有公开的说明教程,并且所需的工具亦是公开可用的。但是,创建新的装配hubs的步骤却很多。在某些情况下,研究者需要调整输入文件的格式,而这对于没有编程背景的科学家来说是一项困难的任务。因此,我们开发了一种新的命令行工具MakeHub,用于在命令行上实现将BRAKER、MAKER、GlimmerHMM、SNAP和GeMoMa多个软件输出的单个物种基因组注释信息完全自动生成UCSC装配hubs。此方法还允许通过增加轨道进而扩充先前创建的hubs。MakeHub 可以从https://github.com/Gaius-Augustus/MakeHub 免费获得。

Page 546-549


Method

Gclust: A Parallel Clustering Tool for Microbial Genomic Data

Ruilin Li, Xiaoyu He, Chuangchuang Dai, Haidong Zhu, Xianyu Lang, Wei Chen, Xiaodong Li, Dan Zhao, Yu Zhang, Xinyin Han, Tie Niu, Yi Zhao, Rongqiang Cao, Rong He, Zhonghua Lu, Xuebin Chi, Weizhong Li, Beifang Niu

The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources. Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential. However, existing clustering algorithms perform poorly on long genomic sequences. In this article, we present Gclust, a parallel program for clustering complete or draft genomic sequences, where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algorithm using sparse suffix arrays (SSAs). Moreover, genome identity measures between two sequences are calculated based on their maximal exact matches (MEMs). In this paper, we demonstrate the high speed and clustering quality of Gclust by examining four genome sequence datasets. Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust. We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust.
国际公共微生物基因组数据不断加速增长,给研究人员对此类资源的使用带来了沉重的负担。因此,针对大规模微生物基因组数据,基于聚类分析技术构建非冗余参考序列的数据库至关重要。然而,当前的聚类算法在长基因组序列上的聚类性能较低。针对这一难题,我们开发了针对大规模微生物全基因组并行聚类的软件Gclust。首先,该软件采用新颖的并行化策略和基于稀疏后缀数组(sparse suffix arrays,SSAs)的快速序列比对算法来加速聚类。其次,文中采用了序列的最大精确匹配(maximal exact matches,MEMs)来计算两条序列之间的一致性。最后,在四个标准基因组序列数据集上进行测试,实验结果表明:Gclust具有最高的聚类性能和聚类质量。Gclust是一款开源的软件,可供研究者用于非商业用途的免费下载,下载地址:https://github.com/niu-lab/gclust。我们还开发了可供用户上传基因组数据进行测试的在线聚类平台,访问地址:http://niulab.scgrid.cn/gclust。

Page 596-502