Article Online - Genomics, Proteomics & Bioinformatics

Volume: 16, Issue: 4

Preface

Bioinformatics Commons: The Cornerstone of Life and Health Sciences

Zhang Zhang, Yu Xue, Fangqing Zhao

View abstract

Page 223-225

Download 1333

Database

CIRCpedia v2: An Updated Database for Comprehensive Circular RNA Annotation and Expression Comparison

Rui Dong, Xu-Kai Ma, Guo-Wei Li, Li Yang

View abstract

Circular RNAs (circRNAs) from back-splicing of exon(s) have been recently identified to be broadly expressed in eukaryotes, in tissue- and species-specific manners. Although functions of most circRNAs remain elusive, some circRNAs are shown to be functional in gene expression regulation and potentially relate to diseases. Due to their stability, circRNAs can also be used as biomarkers for diagnosis. Profiling circRNAs by integrating their expression among different samples thus provides molecular basis for further functional study of circRNAs and their potential application in clinic. Here, we report CIRCpedia v2, an updated database for comprehensive circRNA annotation from over 180 RNA-seq datasets across six different species. This atlas allows users to search, browse, and download circRNAs with expression features in various cell types/tissues, including disease samples. In addition, the updated database incorporates conservation analysis of circRNAs between humans and mice. Finally, the web interface also contains computational tools to compare circRNA expression among samples. CIRCpedia v2 is accessible at http://www.picb.ac.cn/rnomics/circpedia.

外显子反向剪接产生的环形RNA是一类不具有5'末端帽子和3'末端poly(A)尾巴、却以共价键形成闭环结构的RNA新分子，其在真核生物中广泛表达、并具有显著的组织和物种特异表达方式。因此，绘制环形RNA的组织和物种表达谱将为深入研究其详细的生成加工机制和潜在功能作用原理奠定基础。中国科学院—德国马普学会计算生物学伙伴研究所杨力研究组，近期发布了升级版的环形RNA数据库网站CIRCpedia v2 (http://www.picb.ac.cn/rnomics/circpedia)，其包含了6个物种中超过180个样品的环形RNA分析数据。使用者可通过检索、浏览、下载等模块获取环形RNA基因组坐标、表达水平、可变反向剪接、人鼠保守性等多样化信息，并通过新的在线分析工具对不同样品中的环形RNA开展比较分析。这一升级版的环形RNA数据库网站为环形RNA研究提供了一个全面和综合性的平台，为深入开展环形RNA功能研究提供了数据支持和理论依据。

Page 226-233

Download 1333

Database

HeteroMeth: A Database of Cell-to-cell Heterogeneity in DNA Methylation

Qing Huan, Yuliang Zhang, Shaohuan Wu, Wenfeng Qian

View abstract

DNA methylation is an important epigenetic mark that plays a vital role in gene expression and cell differentiation. The average DNA methylation level among a group of cells has been extensively documented. However, the cell-to-cell heterogeneity in DNA methylation, which reflects the differentiation of epigenetic status among cells, remains less investigated. Here we established a gold standard of the cell-to-cell heterogeneity in DNA methylation based on single-cell bisulfite sequencing (BS-seq) data. With that, we optimized a computational pipeline for estimating the heterogeneity in DNA methylation from bulk BS-seq data. We further built HeteroMeth, a database for searching, browsing, visualizing, and downloading the data for heterogeneity in DNA methylation for a total of 141 samples in humans, mice, Arabidopsis, and rice. Three genes are used as examples to illustrate the power of HeteroMeth in the identification of unique features in DNA methylation. The optimization of the computational strategy and the construction of the database in this study complement the recent experimental attempts on single-cell DNA methylomes and will facilitate the understanding of epigenetic mechanisms underlying cell differentiation and embryonic development. HeteroMeth is publicly available at http://qianlab.genetics.ac.cn/HeteroMeth.

DNA甲基化作为一个重要的表观调控因子，在基因表达调控和细胞分化的过程中发挥着至关重要的作用。研究者通常以细胞群体作为一个整体，基于这群细胞的平均DNA甲基化水平开展分析。值得注意的是，不同细胞之间DNA甲基化修饰并非均一，而这种异质性可能反映了细胞间表观修饰状态的分化。但是目前关于单细胞DNA甲基化修饰异质性的研究还鲜见报道。我们基于单细胞亚硫酸氢盐测序数据，为计算单细胞DNA甲基化异质性建立了金标准，进而优化了从细胞群体样品亚硫酸氢盐测序数据中计算单细胞DNA甲基化异质性的策略，并搭建了HeteroMeth数据库。该数据库提供了来自人类、小鼠、拟南芥和水稻共141个样品的DNA甲基化异质性数据，可以方便得进行查找、浏览、可视化和下载。此研究中计算策略的优化和数据库的建立，将推动DNA甲基化修饰异质性特征的系统识别，从而为细胞分化和胚胎发育过程中表观调控机制的探索提供关键性的支持。HeteroMeth的公共链接地址为：http://qianlab.genetics.ac.cn/HeteroMeth。

Page 234-243

Download 1058

Database

PTMD: A Database of Human Disease-associated Post-translational Modifications

Haodong Xu, Yongbo Wang, Shaofeng Lin, Wankun Deng, Di Peng, Qinghua Cui, YuXue

View abstract

Various posttranslational modifications (PTMs) participate in nearly all aspects of biological processes by regulating protein functions, and aberrant states of PTMs are frequently implicated in human diseases. Therefore, an integral resource of PTM–disease associations (PDAs) would be a great help for both academic research and clinical use. In this work, we reported PTMD, a well-curated database containing PTMs that are associated with human diseases. We manually collected 1950 known PDAs in 749 proteins for 23 types of PTMs and 275 types of diseases from the literature. Database analyses show that phosphorylation has the largest number of disease associations, whereas neurologic diseases have the largest number of PTM associations. We classified all known PDAs into six classes according to the PTM status in diseases and demonstrated that the upregulation and presence of PTM events account for a predominant proportion of disease-associated PTM events. By reconstructing a disease–gene network, we observed that breast cancers have the largest number of associated PTMs and AKT1 has the largest number of PTMs connected to diseases. Finally, the PTMD database was developed with detailed annotations and can be a useful resource for further analyzing the relations between PTMs and human diseases. PTMD is freely accessible at http://ptmd.biocuckoo.org.

通过调控蛋白质的功能，蛋白质翻译后修饰（简称：修饰）几乎参与了所有的生物学过程，并且修饰异常状态常常与人类疾病有着密切的联系。因此，整合已有的疾病相关修饰信息将对学术研究和临床应用提供非常巨大的帮助。在这项工作中，我们发布了一个精准注释的与人类疾病相关修饰信息的数据库PTMD。我们从文献中手工收集了1950个疾病相关修饰信息。这些疾病相关修饰位于749个蛋白质上，涵盖了23种修饰类型和275种疾病类型。其中，磷酸化修饰有最多的疾病关联，而神经系统疾病则覆盖了最多的修饰类型。我们将所有已知的疾病相关修饰按照修饰对疾病的影响分为六类，结果表明修饰水平上调和修饰的存在与疾病有着更为紧密的关联。通过构建疾病−基因作用网络，我们发现乳腺癌拥有最大数量的修饰关联，而AKT1基因上则拥有最大数目的疾病相关修饰信息。最后，PTMD数据库带有非常详尽的注释信息，可以成为进一步分析修饰与人类疾病之间关系的有用资源。用户可以通过http://ptmd.biocuckoo.org访问PTMD数据库。

Page 244-251

Download 1335

Database

GAAD: A Gene and Autoimmiune Disease Association Database

Guanting Lu, Xiaowen Hao, Wei-Hua Chen, Shijie Mu

View abstract

Autoimmune diseases (ADs) arise from an abnormal immune response of the body against substances and tissues normally present in the body. More than a hundred of ADs have been described in the literature so far. Although their etiology remains largely unclear, various types of ADs tend to share more associated genes with other types of ADs than with non-AD types. Here we present GAAD, a gene and AD association database. In GAAD, we collected 44,762 associations between 49 ADs and 4249 genes from public databases and MEDLINE documents. We manually verified the associations to ensure the quality and credibility. We reconstructed and recapitulated the relationships among ADs using their shared genes, which further validated the quality of our data. We also provided a list of significantly co-occurring gene pairs among ADs; with embedded tools, users can query gene co-occurrences and construct customized co-occurrence network with genes of interest. To make GAAD more straightforward to experimental biologists and medical scientists, we extracted additional information describing the associations through text mining, including the putative diagnostic value of the associations, type and position of gene polymorphisms, expression changes of implicated genes, as well as the phenotypical consequences, and grouped the associations accordingly. GAAD is freely available at http://gaad.medgenius.info.

自身免疫疾病（autoimmune diseases）是指机体对自身抗原发生免疫反应而导致自身组织损害所引起的疾病。到目前为止，各种文献中已经介绍了超过一百种自身免疫病。尽管我们对于自身免疫病的病因仍然不清楚，但是我们发现：与非自身免疫病相比，自身免疫病之间会有更多的相关基因（associated genes）。基于这点，我们开发了GAAD（A Gene and Autoimmiune Disease Association Database）数据库。在GAAD数据库中，我们收集了来自公共数据库和 MEDLINE 文档的49个自身免疫病和4249个基因之间的44762个关联信息(associations)。我们通过人工检验的方式，保证了这些关联信息的质量和准确度。此外，我们使用了这些自身免疫病的共享基因重建并且重现了自身免疫病之间的关系，从而进一步确保我们数据的可靠性。在数据库中，我们还提供了自身免疫病之间显著共存基因对(co-occurring gene pairs)的列表，根据这个数据，用户可以使用嵌入式工具查询基因共现（gene co-occurrences），并利用感兴趣的基因构建特定的共现网络。为了使实验生物学和医学科学研究人员更加方便地使用数据库，我们通过文本挖掘（text mining）的方式提取了描述每个关联的其他相关信息，包括关联的公认诊断价值、基因多态性的类型和位置、关联基因的表达变化和表型变化，并根据关联进行分组。 GAAD（http://gaad.medgenius.info）数据库支持免费使用和下载数据。

Page 252-261

Download 1339

Database

CCGD-ESCC: A Comprehensive Database for Genetic Variants Associated with Esophageal Squamous Cell Carcinoma in Chinese Population

Linna Peng, Sijin Cheng, Yuan Lin, Qionghua Cui, Yingying Luo, Jiahui Chu, Mingming Shao, Wenyi Fan, Yamei Chen, Ai Lin, Yiyi Xi, Yanxia Sun,Lei Zhang, Chao Zhang, Wen Tan, Ge Gao, Chen Wu, Dongxin Lin

View abstract

Esophageal squamous-cell carcinoma (ESCC) is one of the most lethal malignancies in the world and occurs at particularly higher frequency in China. While several genome-wide association studies (GWAS) of germline variants and whole-genome or whole-exome sequencing studies of somatic mutations in ESCC have been published, there is no comprehensive database publically available for this cancer. Here, we developed the Chinese Cancer Genomic Database-Esophageal Squamous Cell Carcinoma (CCGD-ESCC) database, which contains the associations of 69,593 single nucleotide polymorphisms (SNPs) with ESCC risk in 2022 cases and 2039 controls, survival time of 1006 ESCC patients (survival GWAS) and gene expression (expression quantitative trait loci, eQTL) in 94 ESCC patients. Moreover, this database also provides the associations between 8833 somatic mutations and survival time in 675 ESCC patients. Our user-friendly database is a resource useful for biologists and oncologists not only in identifying the associations of genetic variants or somatic mutations with the development and progression of ESCC but also in studying the underlying mechanisms for tumorigenesis of the cancer. CCGD-ESCC is freely accessible at http://db.cbi.pku.edu.cn/ccgd/ESCCdb.

食管癌作为中国人群的特色肿瘤，基因组数据相对于其它肿瘤仍显不足。目前，国际上仍没有一个全面系统展现、查询的食管癌关联研究数据库。因此，我们整合分析了多种食管癌关联数据，包括（1）2022个食管癌病例和2039个正常对照的食管癌易感性全基因组关联研究；（2）1006个食管癌患者生存的全基因组关联研究；（3）94个食管癌患者的肿瘤组织和配对癌旁正常组织的遗传变异与基因表达的关联研究；（4）675个食管癌患者的体细胞变异与生存的关联研究，建立了首个食管癌关联基因数据库CCGD-ESCC，最大程度上免费共享数据资源，助力食管癌遗传学和基因组学研究。

Page 262-268

Download 1616

Database

HCCDB: A Database of Hepatocellular Carcinoma Expression Atlas

Qiuyu Lian, Shicheng Wang, Guchao Zhang, Dongfang Wang, Guijuan Luo, Jing Tang, Lei Chen, Jin Gu

View abstract

Hepatocellular carcinoma (HCC) is highly heterogeneous in nature and has been one of the most common cancer types worldwide. To ensure repeatability of identified gene expression patterns and comprehensively annotate the transcriptomes of HCC, we carefully curated 15 public HCC expression datasets that cover around 4000 clinical samples and developed the database HCCDB to serve as a one-stop online resource for exploring HCC gene expression with user-friendly interfaces. The global differential gene expression landscape of HCC was established by analyzing the consistently differentially expressed genes across multiple datasets. Moreover, a 4D metric was proposed to fully characterize the expression pattern of each gene by integrating data from The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx). To facilitate a comprehensive understanding of gene expression patterns in HCC, HCCDB also provides links to third-party databases on drug, proteomics, and literatures, and graphically displays the results from computational analyses, including differential expression analysis, tissue-specific and tumor-specific expression analysis, survival analysis, and co-expression analysis. HCCDB is freely accessible at http://lifeome.net/database/hccdb.

肝细胞癌是一种常见且死亡率很高的癌症。高通量生物技术的发展使得人们可以从不同的分子水平对HCC进行描述。目前，表达谱数据已经大量积累并广泛应用于分子分型和肿瘤标志物的研究中。然而，对于HCC这样一种高度异质的疾病，只从一个样本量有限的表达谱数据集中做出的推断，很容易出现可重复性差和假阳性高的问题。为了找到可重复性强的异常表达模式、全面注释HCC中的基因表达情况，我们收集了15个公共的表达谱数据集，总共近4,000个临床样本，并开发了数据库HCCDB，提供用户友好的一站式在线检索服务。我们通过分析多个数据集得到一致差异表达基因图谱，再结合TCGA和GTEx的数据定义了一个4D指标，用于全面描述每个基因的表达模式。此外，HCCDB还提供药物、蛋白、文献挖掘等第三方链接，并通过图表直观地展示差异分析、组织特异性、生存期分析以及共表达等计算结果。HCCDB数据库网址：http://lifeome.net/database/hccdb.

Page 269-275

Download 2198

Database

TSNAdb: A Database for Tumor-specific Neoantigens from Immunogenomics Data Analysis

Jingcheng Wu, Wenyi Zhao, Binbin Zhou, Zhixi Su, Xun Gu, Zhan Zhou, Shuqing Chen

View abstract

Tumor-specific neoantigens have attracted much attention since they can be used as biomarkers to predict therapeutic effects of immune checkpoint blockade therapy and as potential targets for cancer immunotherapy. In this study, we developed a comprehensive tumor-specific neoantigen database (TSNAdb v1.0), based on pan-cancer immunogenomic analyses of somatic mutation data and human leukocyte antigen (HLA) allele information for 16 tumor types with 7748 tumor samples from The Cancer Genome Atlas (TCGA) and The Cancer Immunome Atlas (TCIA). We predicted binding affinities between mutant/wild-type peptides and HLA class I molecules by NetMHCpan v2.8/v4.0, and presented detailed information of 3,707,562/1,146,961 potential neoantigens generated by somatic mutations of all tumor samples. Moreover, we employed recurrent mutations in combination with highly frequent HLA alleles to predict potential shared neoantigens across tumor patients, which would facilitate the discovery of putative targets for neoantigen-based cancer immunotherapy. TSNAdb is freely available at http://biopharm.zju.edu.cn/tsnadb.

随着肿瘤基因组学和免疫治疗的快速发展，肿瘤特异性新抗原的重要性愈发凸显。其不仅可作为预测检查点抑制疗法的疗效指标，也可以作为肿瘤免疫细胞治疗的潜在靶点。本研究基于肿瘤免疫基因组学分析开发了针对肿瘤特异性新抗原的系统分析数据库TSNAdb v1.0。我们从TCGA数据库中收集了 16个肿瘤类型共7748例肿瘤样本的体细胞突变信息，并从数据库TCIA中得到对应肿瘤样本的HLA分型，利用HLA分型与多肽亲和力预测软件NetMHCpan进行了新生抗原的预测。我们利用两个版本的NetMHCpan（v2.8和v 4.0）对体细胞突变产生的肿瘤特异性新抗原进行了预测，分别得到3,707,562 和1,146,961个潜在的肿瘤新抗原。此外，我们提取肿瘤样本中出现的高频体细胞突变和高频HLA分型信息，以此为基础预测在肿瘤患者群体中广泛存在的潜在新抗原，为新抗原靶向的免疫治疗提供潜在靶点。我们相信，随着肿瘤免疫基因组学的不断进步，将不断促进肿瘤特异性新抗原的发现鉴定，以及新抗原靶向的肿瘤免疫治疗方法的开发。TSNAdb数据库可以通过http://biopharm.zju.edu.cn/tsnadb/ 免费开放获取。

Page 276-282

Download 1321

Database

PlaD: A Transcriptomics Database for Plant Defense Responses to Pathogens, Providing New Insights into Plant Immune System

Huan Qi, Zhenhong Jiang, Kang Zhang, Shiping Yang, Fei He, Ziding Zhang

View abstract

High-throughput transcriptomics technologies have been widely used to study plant transcriptional reprogramming during the process of plant defense responses, and a large quantity of gene expression data have been accumulated in public repositories. However, utilization of these data is often hampered by the lack of standard metadata annotation. In this study, we curated 2444 public pathogenesis-related gene expression samples from the model plant Arabidopsis and three major crops (maize, rice, and wheat). We organized the data into a user-friendly database termed as PlaD. Currently, PlaD contains three key features. First, it provides large-scale curated data related to plant defense responses, including gene expression and gene functional annotation data. Second, it provides the visualization of condition-specific expression profiles. Third, it allows users to search co-regulated genes under the infections of various pathogens. Using PlaD, we conducted a large-scale transcriptome analysis to explore the global landscape of gene expression in the curated data. We found that only a small fraction of genes were differentially expressed under multiple conditions, which might be explained by their tendency of having more network connections and shorter network distances in gene networks. Collectively, we hope that PlaD can serve as an important and comprehensive knowledgebase to the community of plant sciences, providing insightful clues to better understand the molecular mechanisms underlying plant immune responses. PlaD is freely available at http://systbio.cau.edu.cn/plad/index.php or http://zzdlab.com/plad/index.php.

高通量转录组技术已被广泛应用于植物免疫转录重编程的研究，一些公共数据库中已积累了大量的转录组数据。然而，这些数据由于缺乏标准化的注释，蕴藏在其中的巨大价值还未有效利用。本研究中，我们从模式植物拟南芥以及三个重要作物（玉米，水稻和小麦）中精选了2444个病原菌相关的基因表达样本。通过对这些数据的整理和分析，我们构建了一个用户友好的转录组数据库PlaD目前， PlaD具有以下三个重要特征。第一，它提供了大规模的植物防御反应相关的数据，主要包括基因表达数据和功能注释信息。第二，它实现了条件特异的表达谱的可视化。第三，它允许用户搜索被多种条件共同调控的基因。同时，利用储存在PlaD里的数据，我们开展了大规模的转录组学分析，旨在探索植物基因表达变化的全局特征。我们发现只有少部分基因在多个条件下发生了差异表达，这部分基因在基因功能网络中倾向于有更多的网络连接和更短的网络距离。综上，我们希望PlaD可以作为一个综合的知识库，为植物科学家们进一步研究植物免疫应答机制提供有用的线索。目前，PlaD的网址为：http://systbio.cau.edu.cn/plad/index.php 或 http://zzdlab.com/plad/index.php。

Page 283-293

Download 2366

Web Server

DeepNitro: Prediction of Protein Nitration and Nitrosylation Sites by Deep Learning

Yubin Xie, Xiaotong Luo, Yupeng Li, Li Chen, Wenbin Ma, Junjiu Huang, Jun Cui, Yong Zhao, Yu Xue, Zhixiang Zuo, Jian Ren

View abstract

Protein nitration and nitrosylation are essential post-translational modifications (PTMs) involved in many fundamental cellular processes. Recent studies have revealed that excessive levels of nitration and nitrosylation in some critical proteins are linked to numerous chronic diseases. Therefore, the identification of substrates that undergo such modifications in a site-specific manner is an important research topic in the community and will provide candidates for targeted therapy. In this study, we aimed to develop a computational tool for predicting nitration and nitrosylation sites in proteins. We first constructed four types of encoding features, including positional amino acid distributions, sequence contextual dependencies, physicochemical properties, and position-specific scoring features, to represent the modified residues. Based on these encoding features, we established a predictor called DeepNitro using deep learning methods for predicting protein nitration and nitrosylation. Using n-fold cross-validation, our evaluation shows great AUC values for DeepNitro, 0.65 for tyrosine nitration, 0.80 for tryptophan nitration, and 0.70 for cysteine nitrosylation, respectively, demonstrating the robustness and reliability of our tool. Also, when tested in the independent dataset, DeepNitro is substantially superior to other similar tools with a 7%−42% improvement in the prediction performance. Taken together, the application of deep learning method and novel encoding schemes, especially the position-specific scoring feature, greatly improves the accuracy of nitration and nitrosylation site prediction and may facilitate the prediction of other PTM sites. DeepNitro is implemented in JAVA and PHP and is freely available for academic research at http://deepnitro.renlab.org.

蛋白质硝基化和亚硝基化是一种关键的蛋白质翻译后修饰类型，它在多种常见的细胞调控过程中都发挥着重要的作用。最近的研究表明，在某些关键蛋白上的异常硝基化及亚硝基化水平与多种慢性疾病相关。因此，在修饰底物上鉴定精确的修饰位点是当前研究的重要关注点，并且能为慢性疾病的靶向治疗提供潜在靶点。本研究中，我们针对蛋白质硝基化及亚硝基化开发了一套精确的位点预测工具——DeepNitro。首先，我们在计算模型中引入了氨基酸分布、序列上下游特征、理化性质以及位点特异性打分这四种编码算法来对修饰位点进行训练特征提取。基于这些特征编码，我们利用深度学习算法构建了一个专门针对蛋白质硝基化及亚硝基化的位点预测模型。同时，N折交叉验证显示，本研究所建立的模型可以给出稳定及可信的预测结果，其中对酪氨酸、色氨酸硝基化以及半胱氨酸亚硝基化的预测AUC分别达到0.65、0.80以及0.70。另外，在独立测试集的评估中我们也发现DeepNitro在预测精度上要显著高于当前已有的工具。相较于其他工具，DeepNitro具有7% - 42%的预测性能提升。综合上述，应用深度学习算法及新型的特征编码方法，我们提高了针对蛋白质硝基化及亚硝基化的预测精度，进一步辅助了对这些修饰位点的高通量鉴定。目前，DeepNitro使用JAVA和PHP开发，可以通过http://deepnitro.renlab.org免费获取。

Page 294-306

Download 1926