Volume: 18, Issue: 2

Perspective

Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases

Qingyu Chen, Ramona Britto, Ivan Erill, Constance J. Jeffery, Arthur Liberzon, Michele Magrane, Jun-ichi Onami, Marc Robinson-Rechavi, Jana Sponarova, Justin Zobel, Karin Verspoor

no abstract

Page 91-103


Original Research

DPHL: A DIA Pan-human Protein Mass Spectrometry Library for Robust Biomarker Discovery

Tiansheng Zhu, Yi Zhu, Yue Xuan, Huanhuan Gao, Xue Cai, Sander R. Piersma, Thang V. Pham, Tim Schelfhorst, Richard R.G.D. Haas, Irene V. Bijnsdorp, Rui Sun, Liang Yue, Guan Ruan, Qiushi Zhang, Mo Hu, Yue Zhou, Winan J. Van Houdt, Tessa Y.S. Le Large, Jacqueline Cloos, Anna Wojtuszkiewicz, Danijela Koppers-Lalic, Franziska Böttger, Chantal Scheepbouwer, Ruud H. Brakenhoff, Geert J.L.H. van Leenders, Jan N.M. Ijzermans, John W.M. Martens, Renske D.M. Steenbergen, Nicole C. Grieken, Sathiyamoorthy Selvarajan, Sangeeta Mantoo, Sze S. Lee, Serene J.Y. Yeow, Syed M.F. Alkaff, Nan Xiang, Yaoting Sun, Xiao Yi, Shaozheng Dai, Wei Liu, Tian Lu, Zhicheng Wu, Xiao Liang, Man Wang, Yingkuan Shao, Xi Zheng, Kailun Xu, Qin Yang, Yifan Meng, Cong Lu, Jiang Zhu, Jin'e Zheng, Bo Wang, Sai Lou, Yibei Dai, Chao Xu, Chenhuan Yu, Huazhong Ying, Tony K. Lim, Jianmin Wu, Xiaofei Gao, Zhongzhi Luan, Xiaodong Teng, Peng Wu, Shi'ang Huang, Zhihua Tao, Narayanan G. Iyer, Shuigeng Zhou, Wenguang Shao, Henry Lam, Ding Ma, Jiafu Ji, Oi L. Kon, Shu Zheng, Ruedi Aebersold, Connie R. Jimenez, Tiannan Guo

To address the increasing need for detecting and validating protein biomarkers in clinical specimens, mass spectrometry (MS)-based targeted proteomic techniques, including the selected reaction monitoring (SRM), parallel reaction monitoring (PRM), and massively parallel data-independent acquisition (DIA), have been developed. For optimal performance, they require the fragment ion spectra of targeted peptides as prior knowledge. In this report, we describe a MS pipeline and spectral resource to support targeted proteomics studies for human tissue samples. To build the spectral resource, we integrated common open-source MS computational tools to assemble a freely accessible computational workflow based on Docker. We then applied the workflow to generate DPHL, a comprehensive DIA pan-human library, from 1096 data-dependent acquisition (DDA) MS raw files for 16 types of cancer samples. This extensive spectral resource was then applied to a proteomic study of 17 prostate cancer (PCa) patients. Thereafter, PRM validation was applied to a larger study of 57 PCa patients and the differential expression of three proteins in prostate tumor was validated. As a second application, the DPHL spectral resource was applied to a study consisting of plasma samples from 19 diffuse large B cell lymphoma (DLBCL) patients and 18 healthy control subjects. Differentially expressed proteins between DLBCL patients and healthy control subjects were detected by DIA-MS and confirmed by PRM. These data demonstrate that the DPHL supports DIA and PRM MS pipelines for robust protein biomarker discovery. DPHL is freely accessible at https://www.iprox.org/page/project.html?id=IPX0001400000.

Page 104-119


Database

hTFtarget: A Comprehensive Database for Regulations of Human Transcription Factors and Their Targets

Qiong Zhang, Wei Liu, Hong-Mei Zhang, Gui-Yan Xie, Ya-Ru Miao, Mengxuan Xia, An-Yuan Guo

Transcription factors (TFs) as key regulators play crucial roles in biological processes. The identification of TF–target regulatory relationships is a key step for revealing functions of TFs and their regulations on gene expression. The accumulated data of chromatin immunoprecipitation sequencing (ChIP-seq) provide great opportunities to discover the TF–target regulations across different conditions. In this study, we constructed a database named hTFtarget, which integrated huge human TF target resources (7190 ChIP-seq samples of 659 TFs and high-confidence binding sites of 699 TFs) and epigenetic modification information to predict accurate TF–target regulations. hTFtarget offers the following functions for users to explore TF–target regulations: (1) browse or search general targets of a query TF across datasets; (2) browse TF–target regulations for a query TF in a specific dataset or tissue; (3) search potential TFs for a given target gene or non-coding RNA; (4) investigate co-association between TFs in cell lines; (5) explore potential co-regulations for given target genes or TFs; (6) predict candidate TF binding sites on given DNA sequences; (7) visualize ChIP-seq peaks for different TFs and conditions in a genome browser. hTFtarget provides a comprehensive, reliable and user-friendly resource for exploring human TF–target regulations, which will be very useful for a wide range of users in the TF and gene expression regulation community. hTFtarget is available at http://bioinfo.life.hust.edu.cn/hTFtarget.

Page 120-128


Database

IRESbase: A Comprehensive Database of Experimentally Validated Internal Ribosome Entry Sites

Jian Zhao, Yan Li, Cong Wang, Haotian Zhang, Hao Zhang, Bin Jiang, Xuejiang Guo, Xiaofeng Song

Internal ribosome entry sites (IRESs) are functional RNA elements that can directly recruit ribosomes to an internal position of the mRNA in a cap-independent manner to initiate translation. Recently, IRES elements have attracted much attention for their critical roles in various processes including translation initiation of a new type of RNA, circular RNA (circRNA), with no 5′ cap to support classical cap-dependent translation. Thus, an integrative data resource of IRES elements with experimental evidence will be useful for further studies. In this study, we present IRESbase, a comprehensive database of IRESs, by curating the experimentally validated functional minimal IRES elements from literature and annotating their host linear and circular RNAs. The current version of IRESbase contains 1328 IRESs, including 774 eukaryotic IRESs and 554 viral IRESs from 11 eukaryotic organisms and 198 viruses, respectively. As IRESbase collects only IRES of minimal length with functional evidence, the median length of IRESs in IRESbase is 174 nucleotides. By mapping IRESs to human circRNAs and long non-coding RNAs (lncRNAs), 2191 circRNAs and 168 lncRNAs were found to contain at least one entire or partial IRES sequence. IRESbase is available at http://reprod.njmu.edu.cn/cgi-bin/iresbase/index.php.
内部核糖体进入位点(IRES)是一类以非帽依赖方式招募核糖体启动翻译过程的RNA内部功能元件。近来,IRES元件因其在环形RNA(circRNA)翻译起始中的关键性作用引起了人们的广泛关注。circRNA首尾相连,天然缺失帽依赖翻译起始机制所需的5′端帽结构,因此IRES元件对其编码功能至关重要。为促进circRNA的深入研究,我们从文献中手工收集整理了实验验证的IRES元件,建立了一个综合数据库IRESbase,对包含IRES元件的线性及环形RNA进行了详细注释。目前,IRESbase收集了11种真核生物和198种病毒的IRES元件,总共1328个(真核:774,病毒:554)。IRESbase仅收录功能验证的最短IRES元件,库中IRES元件的核苷酸长度中位数为174 nt。此外,基于基因组位置关系,我们发现2191个circRNA和168个lncRNA包含IRES元件或其部分片段。IRESbase访问地址:http://reprod.njmu.edu.cn/cgi-bin/iresbase/index.php。

Page 129-139


Database

MosaicBase: A Knowledgebase of Postzygotic Mosaic Variants in Noncancer Disease-related and Healthy Human Individuals

Xiaoxu Yang, Changhong Yang, Xianing Zheng, Luoxing Xiong, Yutian Tao, Meng Wang, Adam Yongxin Ye, Qixi Wu, Yanmei Dou, Junyu Luo, Liping Wei, August Yue Huang

Mosaic variants resulting from postzygotic mutations are prevalent in the human genome and play important roles in human diseases. However, except for cancer-related variants, there is no collection of postzygotic mosaic variants in noncancer disease-related and healthy individuals. Here, we present MosaicBase, a comprehensive database that includes 6698 mosaic variants related to 266 noncancer diseases and 27,991 mosaic variants identified in 422 healthy individuals. Genomic and phenotypic information of each variant was manually extracted and curated from 383 publications. MosaicBase supports the query of variants with Online Mendelian Inheritance in Man (OMIM) entries, genomic coordinates, gene symbols, or Entrez IDs. We also provide an integrated genome browser for users to easily access mosaic variants and their related annotations for any genomic region. By analyzing the variants collected in MosaicBase, we find that mosaic variants that directly contribute to disease phenotype show features distinct from those of variants in individuals with mild or no phenotypes, in terms of their genomic distribution, mutation signatures, and fraction of mutant cells. MosaicBase will not only assist clinicians in genetic counseling and diagnosis but also provide a useful resource to understand the genomic baseline of postzygotic mutations in the general human population. MosaicBase is publicly available at http://mosaicbase.com/ or http://49.4.21.8:8000.
合子后突变导致的嵌合现象在人类基因组中广泛存在,在人类疾病发生中有重要作用。目前已有的嵌合突变数据库只针对癌症相关突变,关于非癌症疾病领域和健康人群携带的非癌症嵌合突变尚无相关数据库资源。我们在此发布MosaicBase数据库,本数据库全面收集了已发表的266种非癌症疾病相关的6698个嵌合突变和442名健康人中报导的27991个嵌合突变并提供了丰富的注释信息。本数据库也囊括了从383篇嵌合突变相关文献中手工整理的每个突变携带者的基因型和表型信息。MosaicBase目前支持使在线人类孟德尔疾病遗传数据库(OMIM)疾病编号、基因组坐标、基因名、Entrez编号等进行搜索。MosaicBase也内置了支持用户自由定制的基因组浏览器用来对任何基因组区域内的所有嵌合突变进行可视化。通过分析MosaicBase中收集的所有嵌合突变,我们发现与健康人群或表型较轻的突变携带者中检测到的嵌合突变相比,能直接导致完整疾病表型的嵌合突变具有明显不同的基因组分布规律、碱基替换特征和突变等位基因比例。MosaicBase不仅有助于医护人员进行遗传咨询和诊断,而且为研究健康人群的嵌合突变提供了基准资源。MosaicBase可以通过http://mosaicbase.com/或http://49.4.21.8:8000进行访问。

Page 140-149


Database

iGMDR: Integrated Pharmacogenetic Resource Guide to Cancer Therapy and Research

Xiang Chen, Yi Guo, Xin Chen

Current pharmacogenetic studies have obtained many genetic models that can predict the therapeutic efficacy of anticancer drugs. Although some of these models are of crucial importance and have been used in clinical practice, these very valuable models have not been well adopted into cancer research to promote the development of cancer therapies due to the lack of integration and standards for the existing data of the pharmacogenetic studies. For this purpose, we built a resource investigating genetic model of drug response (iGMDR), which integrates the models from in vitro and in vivo pharmacogenetic studies with different omics data from a variety of technical systems. In this study, we introduced a standardized process for all integrations, and described how users can utilize these models to gain insights into cancer. iGMDR is freely accessible at https://igmdr.modellab.cn.
在癌症中,解析基因的遗传变异信息最常用于指示药物治疗的可靠性和有效性。目前的药物基因组学研究通过体内和体外的方式已经获得了抗癌药物治疗相关的许多遗传学信息。这些信息筛选的终极目标是测量机体对于抗癌药物治疗的敏感性,获得高效的预测药物治疗效果的新的遗传学模型。尽管许多的遗传学模型已经被用于癌症治疗的临床实验研究和临床实践,但由于对现有药物遗传学研究数据缺乏整合和标准,这些非常有价值的模型并没有被很好的利用去反过来促进癌症的研究。因此需要一个新的在线资源去整合分析这些遗传学模型,促进这些信息的有效利用,释放它们的价值。为此,我们发展了iGMDR,整合不同技术体系、体外和体内药物基因组学研究的资源。在文章中,我们描述如何整合和标化不同体系下的药物基因组学模型,以及用户如何使用这些信息去提高癌症治疗的认识。在案例分析中我们借助整合的模型数据设计了临床测序新的panel、设计了药物组合治疗的策略。此外,我们还从模型数据的层面分析了组织特异性的药物敏感性。iGMDR提供了一个独特的资源来挖掘抗癌药物和个人基因组的关联, 通过大数据企图发现新的癌症知识。iGMDR的网址是https://igmdr.modellab.cn。

Page 150-160


Databasse

IC4R-2.0: Rice Genome Reannotation Using Massive RNA-seq Data

Jian Sang, Dong Zou, Zhennan Wang, Fan Wang, Yuansheng Zhang, Lin Xia, Zhaohua Li, Lina Ma, Mengwei Li, Bingxiang Xu, Xiaonan Liu, Shuangyang Wu, Lin Liu, Guangyi Niu, Man Li, Yingfeng Luo, Songnian Hu, Lili Hao, Zhang Zhang

Genome reannotation aims for complete and accurate characterization of gene models and thus is of critical significance for in-depth exploration of gene function. Although the availability of massive RNA-seq data provides great opportunities for gene model refinement, few efforts have been made to adopt these precious data in rice genome reannotation. Here we reannotate the rice (Oryza sativa L. ssp. japonica) genome based on integration of large-scale RNA-seq data and release a new annotation system IC4R-2.0. In general, IC4R-2.0 significantly improves the completeness of gene structure, identifies a number of novel genes, and integrates a variety of functional annotations. Furthermore, long non-coding RNAs (lncRNAs) and circular RNAs (circRNAs) are systematically characterized in the rice genome. Performance evaluation shows that compared to previous annotation systems, IC4R-2.0 achieves higher integrity and quality, primarily attributable to massive RNA-seq data applied in genome annotation. Consequently, we incorporate the improved annotations into the Information Commons for Rice (IC4R), a database integrating multiple omics data of rice, and accordingly update IC4R by providing more user-friendly web interfaces and implementing a series of practical online tools. Together, the updated IC4R, which is equipped with the improved annotations, bears great promise for comparative and functional genomic studies in rice and other monocotyledonous species. The IC4R-2.0 annotation system and related resources are freely accessible at http://ic4r.org/.
基因组重注释是不断修正基因模型的过程,对模式生物与非模式生物功能基因的深度解析具有重要意义。转录组测序技术由于能有效地识别基因组中的可变剪接位点,敏感地鉴定出低丰度表达基因与组织特异性基因,在基因组重注释研究中有巨大的应用潜力。鉴于目前水稻中已积累了海量转录组测序数据,我们开发了一套以公共RNA-Seq数据大规模整合分析为基础的基因组注释流程,对水稻基因组开展重注释研究,进而获得了一套新的水稻基因组注释系统:IC4R-2.0。结果表明,IC4R-2.0通过外显子/内含子区域矫正,新UTR区域识别,基因融合及新基因挖掘等方式,对原注释系统中蛋白质编码基因的结构进行了更新。同时,我们对水稻基因组中的长链非编码RNA(lncRNA)与环形RNA(circRNA)进行了鉴定。通过整合多个基因组功能注释平台的资源,我们为水稻基因提供了更为丰富的功能注释信息。不同版本水稻基因组注释系统的定量评估与比较分析表明,大规模整合转录组测序数据的确可以使水稻基因模型的完整度与注释质量获得提升。为方便用户获取水稻基因组重注释信息,我们在水稻生物信息门户IC4R (v 1.0)的基础上进行了重新设计及二次开发,不但有效地整合了水稻基因组重注释信息,还提供了更为友好的数据展示界面,提高了数据检索效率,并提供了一系列丰富而实用的在线分析工具。本研究为在水稻和其他单子叶植物中开展大规模基因功能解析等相关工作提供了数据基础。IC4R-2.0注释系统及相关资源可通过http://ic4r.org/ 来获取。

Page 161-172


Database

SR4R: An Integrative SNP Resource for Genomic Breeding and Population Research in Rice

Jun Yan, Dong Zou, Chen Li, Zhang Zhang, Shuhui Song, Xiangfeng Wang

The information commons for rice (IC4R) database is a collection of 18 million single nucleotide polymorphisms (SNPs) identified by resequencing of 5152 rice accessions. Although IC4R offers ultra-high density rice variation map, these raw SNPs are not readily usable for the public. To satisfy different research utilizations of SNPs for population genetics, evolutionary analysis, association studies, and genomic breeding in rice, raw genotypic data of these 18 million SNPs were processed by unified bioinformatics pipelines. The outcomes were used to develop a daughter database of IC4R – SnpReady for Rice (SR4R). SR4R presents four reference SNP panels, including 2,097,405 hapmapSNPs after data filtration and genotype imputation, 156,502 tagSNPs selected from linkage disequilibrium-based redundancy removal, 1180 fixedSNPs selected from genes exhibiting selective sweep signatures, and 38 barcodeSNPs selected from DNA fingerprinting simulation. SR4R thus offers a highly efficient rice variation map that combines reduced SNP redundancy with extensive data describing the genetic diversity of rice populations. In addition, SR4R provides rice researchers with a web interface that enables them to browse all four SNP panels, use online toolkits, as well as retrieve the original data and scripts for a variety of population genetics analyses on local computers. SR4R is freely available to academic users at http://sr4r.ic4r.org/.
IC4R(Information Commons for Rice)数据库收集了5152个水稻样本的1800万个单核苷酸多态性(Single Nucleotide Polymorphism,SNP)。如此大规模的原始基因组变异图谱中包含大量的基因型缺失和位点冗余,并不能直接用于作物遗传育种研究的不同领域。为了提高基因组变异图谱的质量、易用性和通用性,需要构建由不同密度的高质量的SNP位点组成的分级基因组变异图谱。为满足水稻群体遗传学、进化分析、关联分析和基因组选择育种等方面的不同研究目的,我们对包含1800万个SNP的原始基因型数据进行统一的生物信息学处理,用于建立水稻IC4R-SR4R(SNP Ready for Rice)子数据库。SR4R数据库提供了4套分级基因组变异图谱,分别为经过数据过滤和基因推断后的2,097,405个hapmapSNPs,基于连锁不平衡去冗余后筛选出的156,502个tagSNPs,基于选择性清除扫描获得1180个fixedSNPs,以及基于DNA指纹模拟筛选出的38个barcodeSNPs。SR4R数据库不仅提供上述水稻分级基因组变异图谱数据的基因型信息查询和下载,还提供了18个用于本地分析的小程序,以及两个在线的基于机器学习的水稻亚群划分和品种预测的小工具。SR4R数据库有助于推进水稻遗传育种研究,其网址是http://sr4r.ic4r.org/。

Page 173-185


Database

BGVD: An Integrated Database for Bovine Sequencing Variations and Selective Signatures

Ningbo Chen, Weiwei Fu, Jianbang Zhao, Jiafei Shen, Qiuming Chen, Zhuqing Zheng, Hong Chen, Tad S. Sonstegard, Chuzhao Lei, Yu Jiang

Next-generation sequencing has yielded a vast amount of cattle genomic data for global characterization of population genetic diversity and identification of genomic regions under natural and artificial selection. However, efficient storage, querying, and visualization of such large datasets remain challenging. Here, we developed a comprehensive database, the Bovine Genome Variation Database (BGVD). It provides six main functionalities: gene search, variation search, genomic signature search, Genome Browser, alignment search tools, and the genome coordinate conversion tool. BGVD contains information on genomic variations comprising ~60.44 M SNPs, ~6.86 M indels, 76,634 CNV regions, and signatures of selective sweeps in 432 samples from modern cattle worldwide. Users can quickly retrieve distribution patterns of these variations for 54 cattle breeds through an interactive source of breed origin map, using a given gene symbol or genomic region for any of the three versions of the bovine reference genomes (ARS-UCD1.2, UMD3.1.1, and Btau 5.0.1). Signals of selection sweep are displayed as Manhattan plots and Genome Browser tracks. To further investigate and visualize the relationships between variants and signatures of selection, the Genome Browser integrates all variations, selection data, and resources, from NCBI, the UCSC Genome Browser, and Animal QTLdb. Collectively, all these features make the BGVD a useful archive for in-depth data mining and analyses of cattle biology and cattle breeding on a global scale. BGVD is publicly available at http://animal.nwsuaf.edu.cn/BosVar.
为了研究世界家牛的遗传多样性和鉴定基因组受选择区域,通过高通量测序技术已经产生了大量的牛基因组重测序数据。然而,对如此庞大的数据集进行有效的存储、查询和可视化仍然具有挑战性。本研究中,我们利用全世界54个家牛品种432个样本的重测序数据开发了牛基因组变异数据库 (the Bovine Genome Variation Database,BGVD)。BGVD包括六个主要功能: 基因快速检索、变异检索、基因组选择信号检索、基因组浏览器、比对工具和基因组坐标转换。BGVD存储了~60.44 M SNPs、~6.86 M indels、76,634个CNV以及六大群体的选择信号信息。用户可以通过搜索基因名或位置,快速检索三个参考基因组中(ARS-UCD1.2、UMD3.1.1和Btau 5.0.1) 54个牛品种的遗传变异分布模式和六个群体的选择信号。选择信号通过曼哈顿图和基因组浏览器来展示。基因组浏览器不仅包括基因组遗传变异和选择信号的信息,还整合了NCBI、UCSC基因组浏览器、动物QTLdb的资源。综上所述,所有这些特性使BGVD成为一个非常实用的牛基因组遗传变异数据库,可用于深度挖掘和分析全球范围内的牛基因组数据。BGVD的网址是http://animal.nwsuaf.edu.cn/BosVar。

Page 186-193


Method

HybridSucc: A Hybrid-learning Architecture for General and Species-specific Succinylation Site Prediction

Wanshan Ning, Haodong Xu, Peiran Jiang, Han Cheng, Wankun Deng, Yaping Guo, Yu Xue

As an important protein acylation modification, lysine succinylation (Ksucc) is involved in diverse biological processes, and participates in human tumorigenesis. Here, we collected 26,243 non-redundant known Ksucc sites from 13 species as the benchmark data set, combined 10 types of informative features, and implemented a hybrid-learning architecture by integrating deep-learning and conventional machine-learning algorithms into a single framework. We constructed a new tool named HybridSucc, which achieved area under curve (AUC) values of 0.885 and 0.952 for general and human-specific prediction of Ksucc sites, respectively. In comparison, the accuracy of HybridSucc was 17.84%–50.62% better than that of other existing tools. Using HybridSucc, we conducted a proteome-wide prediction and prioritized 370 cancer mutations that change Ksucc states of 218 important proteins, including PKM2, SHMT2, and IDH2. We not only developed a high-profile tool for predicting Ksucc sites, but also generated useful candidates for further experimental consideration. The online service of HybridSucc can be freely accessed for academic research at http://hybridsucc.biocuckoo.org/.

Page 194-207


Method

SuccSite: Incorporating Amino Acid Composition and Informative k-spaced Amino Acid Pairs to Identify Protein Succinylation Sites

Hui-Ju Kao, Van-Nui Nguyen, Kai-Yao Huang, Wen-Chi Chang, Tzong-Yi Lee

Protein succinylation is a biochemical reaction in which a succinyl group (-CO-CH2-CH2-CO-) is attached to the lysine residue of a protein molecule. Lysine succinylation plays important regulatory roles in living cells. However, studies in this field are limited by the difficulty in experimentally identifying the substrate site specificity of lysine succinylation. To facilitate this process, several tools have been proposed for the computational identification of succinylated lysine sites. In this study, we developed an approach to investigate the substrate specificity of lysine succinylated sites based on amino acid composition. Using experimentally verified lysine succinylated sites collected from public resources, the significant differences in position-specific amino acid composition between succinylated and non-succinylated sites were represented using the Two Sample Logo program. These findings enabled the adoption of an effective machine learning method, support vector machine, to train a predictive model with not only the amino acid composition, but also the composition of k-spaced amino acid pairs. After the selection of the best model using a ten-fold cross-validation approach, the selected model significantly outperformed existing tools based on an independent dataset manually extracted from published research articles. Finally, the selected model was used to develop a web-based tool, SuccSite, to aid the study of protein succinylation. Two proteins were used as case studies on the website to demonstrate the effective prediction of succinylation sites. We will regularly update SuccSite by integrating more experimental datasets. SuccSite is freely accessible at http://csb.cse.yzu.edu.tw/SuccSite/.

Page 208-219