1. Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases
Qingyu Chen, Ramona Britto, Ivan Erill, Constance J. Jeffery, Arthur Liberzon, Michele Magrane, Jun-ichi Onami, Marc Robinson-Rechavi, Jana Sponarova, Justin Zobel, Karin Verspoor
2. DPHL: A DIA Pan-human Protein Mass Spectrometry Library for Robust Biomarker Discovery
Tiansheng Zhu, Yi Zhu, Yue Xuan, Huanhuan Gao, Xue Cai, Sander R. Piersma, Thang V. Pham, Tim Schelfhorst, Richard R.G.D. Haas, Irene V. Bijnsdorp, Rui Sun, Liang Yue, Guan Ruan, Qiushi Zhang, Mo Hu, Yue Zhou, Winan J. Van Houdt, Tessa Y.S. Le Large, Jacqueline Cloos, Anna Wojtuszkiewicz, Danijela Koppers-Lalic, Franziska Böttger, Chantal Scheepbouwer, Ruud H. Brakenhoff, Geert J.L.H. van Leenders, Jan N.M. Ijzermans, John W.M. Martens, Renske D.M. Steenbergen, Nicole C. Grieken, Sathiyamoorthy Selvarajan, Sangeeta Mantoo, Sze S. Lee, Serene J.Y. Yeow, Syed M.F. Alkaff, Nan Xiang, Yaoting Sun, Xiao Yi, Shaozheng Dai, Wei Liu, Tian Lu, Zhicheng Wu, Xiao Liang, Man Wang, Yingkuan Shao, Xi Zheng, Kailun Xu, Qin Yang, Yifan Meng, Cong Lu, Jiang Zhu, Jin'e Zheng, Bo Wang, Sai Lou, Yibei Dai, Chao Xu, Chenhuan Yu, Huazhong Ying, Tony K. Lim, Jianmin Wu, Xiaofei Gao, Zhongzhi Luan, Xiaodong Teng, Peng Wu, Shi'ang Huang, Zhihua Tao, Narayanan G. Iyer, Shuigeng Zhou, Wenguang Shao, Henry Lam, Ding Ma, Jiafu Ji, Oi L. Kon, Shu Zheng, Ruedi Aebersold, Connie R. Jimenez, Tiannan Guo
To address the increasing need for detecting and validating protein biomarkers in clinical specimens, mass spectrometry (MS)-based targeted proteomic techniques, including the selected reaction monitoring (SRM), parallel reaction monitoring (PRM), and massively parallel data-independent acquisition (DIA), have been developed. For optimal performance, they require the fragment ion spectra of targeted peptides as prior knowledge. In this report, we describe a MS pipeline and spectral resource to support targeted proteomics studies for human tissue samples. To build the spectral resource, we integrated common open-source MS computational tools to assemble a freely accessible computational workflow based on Docker. We then applied the workflow to generate DPHL, a comprehensive DIA pan-human library, from 1096 data-dependent acquisition (DDA) MS raw files for 16 types of cancer samples. This extensive spectral resource was then applied to a proteomic study of 17 prostate cancer (PCa) patients. Thereafter, PRM validation was applied to a larger study of 57 PCa patients and the differential expression of three proteins in prostate tumor was validated. As a second application, the DPHL spectral resource was applied to a study consisting of plasma samples from 19 diffuse large B cell lymphoma (DLBCL) patients and 18 healthy control subjects. Differentially expressed proteins between DLBCL patients and healthy control subjects were detected by DIA-MS and confirmed by PRM. These data demonstrate that the DPHL supports DIA and PRM MS pipelines for robust protein biomarker discovery. DPHL is freely accessible at https://www.iprox.org/page/project.html?id=IPX0001400000.
3. hTFtarget: A Comprehensive Database for Regulations of Human Transcription Factors and Their Targets
Qiong Zhang, Wei Liu, Hong-Mei Zhang, Gui-Yan Xie, Ya-Ru Miao, Mengxuan Xia, An-Yuan Guo
Transcription factors (TFs) as key regulators play crucial roles in biological processes. The identification of TF–target regulatory relationships is a key step for revealing functions of TFs and their regulations on gene expression. The accumulated data of chromatin immunoprecipitation sequencing (ChIP-seq) provide great opportunities to discover the TF–target regulations across different conditions. In this study, we constructed a database named hTFtarget, which integrated huge human TF target resources (7190 ChIP-seq samples of 659 TFs and high-confidence binding sites of 699 TFs) and epigenetic modification information to predict accurate TF–target regulations. hTFtarget offers the following functions for users to explore TF–target regulations: (1) browse or search general targets of a query TF across datasets; (2) browse TF–target regulations for a query TF in a specific dataset or tissue; (3) search potential TFs for a given target gene or non-coding RNA; (4) investigate co-association between TFs in cell lines; (5) explore potential co-regulations for given target genes or TFs; (6) predict candidate TF binding sites on given DNA sequences; (7) visualize ChIP-seq peaks for different TFs and conditions in a genome browser. hTFtarget provides a comprehensive, reliable and user-friendly resource for exploring human TF–target regulations, which will be very useful for a wide range of users in the TF and gene expression regulation community. hTFtarget is available at http://bioinfo.life.hust.edu.cn/hTFtarget.
4. IRESbase: A Comprehensive Database of Experimentally Validated Internal Ribosome Entry Sites
Jian Zhao, Yan Li, Cong Wang, Haotian Zhang, Hao Zhang, Bin Jiang, Xuejiang Guo, Xiaofeng Song
Internal ribosome entry sites (IRESs) are functional RNA elements that can directly recruit ribosomes to an internal position of the mRNA in a cap-independent manner to initiate translation. Recently, IRES elements have attracted much attention for their critical roles in various processes including translation initiation of a new type of RNA, circular RNA (circRNA), with no 5′ cap to support classical cap-dependent translation. Thus, an integrative data resource of IRES elements with experimental evidence will be useful for further studies. In this study, we present IRESbase, a comprehensive database of IRESs, by curating the experimentally validated functional minimal IRES elements from literature and annotating their host linear and circular RNAs. The current version of IRESbase contains 1328 IRESs, including 774 eukaryotic IRESs and 554 viral IRESs from 11 eukaryotic organisms and 198 viruses, respectively. As IRESbase collects only IRES of minimal length with functional evidence, the median length of IRESs in IRESbase is 174 nucleotides. By mapping IRESs to human circRNAs and long non-coding RNAs (lncRNAs), 2191 circRNAs and 168 lncRNAs were found to contain at least one entire or partial IRES sequence. IRESbase is available at http://reprod.njmu.edu.cn/cgi-bin/iresbase/index.php.
5. MosaicBase: A Knowledgebase of Postzygotic Mosaic Variants in Noncancer Disease-related and Healthy Human Individuals
Xiaoxu Yang, Changhong Yang, Xianing Zheng, Luoxing Xiong, Yutian Tao, Meng Wang, Adam Yongxin Ye, Qixi Wu, Yanmei Dou, Junyu Luo, Liping Wei, August Yue Huang
Mosaic variants resulting from postzygotic mutations are prevalent in the human genome and play important roles in human diseases. However, except for cancer-related variants, there is no collection of postzygotic mosaic variants in noncancer disease-related and healthy individuals. Here, we present MosaicBase, a comprehensive database that includes 6698 mosaic variants related to 266 noncancer diseases and 27,991 mosaic variants identified in 422 healthy individuals. Genomic and phenotypic information of each variant was manually extracted and curated from 383 publications. MosaicBase supports the query of variants with Online Mendelian Inheritance in Man (OMIM) entries, genomic coordinates, gene symbols, or Entrez IDs. We also provide an integrated genome browser for users to easily access mosaic variants and their related annotations for any genomic region. By analyzing the variants collected in MosaicBase, we find that mosaic variants that directly contribute to disease phenotype show features distinct from those of variants in individuals with mild or no phenotypes, in terms of their genomic distribution, mutation signatures, and fraction of mutant cells. MosaicBase will not only assist clinicians in genetic counseling and diagnosis but also provide a useful resource to understand the genomic baseline of postzygotic mutations in the general human population. MosaicBase is publicly available at http://mosaicbase.com/ or http://18.104.22.168:8000.
6. iGMDR: Integrated Pharmacogenetic Resource Guide to Cancer Therapy and Research
Xiang Chen, Yi Guo, Xin Chen
Current pharmacogenetic studies have obtained many genetic models that can predict the therapeutic efficacy of anticancer drugs. Although some of these models are of crucial importance and have been used in clinical practice, these very valuable models have not been well adopted into cancer research to promote the development of cancer therapies due to the lack of integration and standards for the existing data of the pharmacogenetic studies. For this purpose, we built a resource investigating genetic model of drug response (iGMDR), which integrates the models from in vitro and in vivo pharmacogenetic studies with different omics data from a variety of technical systems. In this study, we introduced a standardized process for all integrations, and described how users can utilize these models to gain insights into cancer. iGMDR is freely accessible at https://igmdr.modellab.cn.
7. IC4R-2.0: Rice Genome Reannotation Using Massive RNA-seq Data
Jian Sang, Dong Zou, Zhennan Wang, Fan Wang, Yuansheng Zhang, Lin Xia, Zhaohua Li, Lina Ma, Mengwei Li, Bingxiang Xu, Xiaonan Liu, Shuangyang Wu, Lin Liu, Guangyi Niu, Man Li, Yingfeng Luo, Songnian Hu, Lili Hao, Zhang Zhang
Genome reannotation aims for complete and accurate characterization of gene models and thus is of critical significance for in-depth exploration of gene function. Although the availability of massive RNA-seq data provides great opportunities for gene model refinement, few efforts have been made to adopt these precious data in rice genome reannotation. Here we reannotate the rice (Oryza sativa L. ssp. japonica) genome based on integration of large-scale RNA-seq data and release a new annotation system IC4R-2.0. In general, IC4R-2.0 significantly improves the completeness of gene structure, identifies a number of novel genes, and integrates a variety of functional annotations. Furthermore, long non-coding RNAs (lncRNAs) and circular RNAs (circRNAs) are systematically characterized in the rice genome. Performance evaluation shows that compared to previous annotation systems, IC4R-2.0 achieves higher integrity and quality, primarily attributable to massive RNA-seq data applied in genome annotation. Consequently, we incorporate the improved annotations into the Information Commons for Rice (IC4R), a database integrating multiple omics data of rice, and accordingly update IC4R by providing more user-friendly web interfaces and implementing a series of practical online tools. Together, the updated IC4R, which is equipped with the improved annotations, bears great promise for comparative and functional genomic studies in rice and other monocotyledonous species. The IC4R-2.0 annotation system and related resources are freely accessible at http://ic4r.org/.
基因组重注释是不断修正基因模型的过程，对模式生物与非模式生物功能基因的深度解析具有重要意义。转录组测序技术由于能有效地识别基因组中的可变剪接位点，敏感地鉴定出低丰度表达基因与组织特异性基因，在基因组重注释研究中有巨大的应用潜力。鉴于目前水稻中已积累了海量转录组测序数据，我们开发了一套以公共RNA-Seq数据大规模整合分析为基础的基因组注释流程，对水稻基因组开展重注释研究，进而获得了一套新的水稻基因组注释系统：IC4R-2.0。结果表明，IC4R-2.0通过外显子/内含子区域矫正，新UTR区域识别，基因融合及新基因挖掘等方式，对原注释系统中蛋白质编码基因的结构进行了更新。同时，我们对水稻基因组中的长链非编码RNA（lncRNA）与环形RNA（circRNA）进行了鉴定。通过整合多个基因组功能注释平台的资源，我们为水稻基因提供了更为丰富的功能注释信息。不同版本水稻基因组注释系统的定量评估与比较分析表明，大规模整合转录组测序数据的确可以使水稻基因模型的完整度与注释质量获得提升。为方便用户获取水稻基因组重注释信息，我们在水稻生物信息门户IC4R (v 1.0)的基础上进行了重新设计及二次开发，不但有效地整合了水稻基因组重注释信息，还提供了更为友好的数据展示界面，提高了数据检索效率，并提供了一系列丰富而实用的在线分析工具。本研究为在水稻和其他单子叶植物中开展大规模基因功能解析等相关工作提供了数据基础。IC4R-2.0注释系统及相关资源可通过http://ic4r.org/ 来获取。
8. SR4R: An Integrative SNP Resource for Genomic Breeding and Population Research in Rice
Jun Yan, Dong Zou, Chen Li, Zhang Zhang, Shuhui Song, Xiangfeng Wang
The information commons for rice (IC4R) database is a collection of 18 million single nucleotide polymorphisms (SNPs) identified by resequencing of 5152 rice accessions. Although IC4R offers ultra-high density rice variation map, these raw SNPs are not readily usable for the public. To satisfy different research utilizations of SNPs for population genetics, evolutionary analysis, association studies, and genomic breeding in rice, raw genotypic data of these 18 million SNPs were processed by unified bioinformatics pipelines. The outcomes were used to develop a daughter database of IC4R – SnpReady for Rice (SR4R). SR4R presents four reference SNP panels, including 2,097,405 hapmapSNPs after data filtration and genotype imputation, 156,502 tagSNPs selected from linkage disequilibrium-based redundancy removal, 1180 fixedSNPs selected from genes exhibiting selective sweep signatures, and 38 barcodeSNPs selected from DNA fingerprinting simulation. SR4R thus offers a highly efficient rice variation map that combines reduced SNP redundancy with extensive data describing the genetic diversity of rice populations. In addition, SR4R provides rice researchers with a web interface that enables them to browse all four SNP panels, use online toolkits, as well as retrieve the original data and scripts for a variety of population genetics analyses on local computers. SR4R is freely available to academic users at http://sr4r.ic4r.org/.
IC4R（Information Commons for Rice）数据库收集了5152个水稻样本的1800万个单核苷酸多态性（Single Nucleotide Polymorphism，SNP）。如此大规模的原始基因组变异图谱中包含大量的基因型缺失和位点冗余，并不能直接用于作物遗传育种研究的不同领域。为了提高基因组变异图谱的质量、易用性和通用性，需要构建由不同密度的高质量的SNP位点组成的分级基因组变异图谱。为满足水稻群体遗传学、进化分析、关联分析和基因组选择育种等方面的不同研究目的，我们对包含1800万个SNP的原始基因型数据进行统一的生物信息学处理，用于建立水稻IC4R-SR4R（SNP Ready for Rice）子数据库。SR4R数据库提供了4套分级基因组变异图谱，分别为经过数据过滤和基因推断后的2,097,405个hapmapSNPs，基于连锁不平衡去冗余后筛选出的156,502个tagSNPs，基于选择性清除扫描获得1180个fixedSNPs，以及基于DNA指纹模拟筛选出的38个barcodeSNPs。SR4R数据库不仅提供上述水稻分级基因组变异图谱数据的基因型信息查询和下载，还提供了18个用于本地分析的小程序，以及两个在线的基于机器学习的水稻亚群划分和品种预测的小工具。SR4R数据库有助于推进水稻遗传育种研究，其网址是http://sr4r.ic4r.org/。
9. BGVD: An Integrated Database for Bovine Sequencing Variations and Selective Signatures
Ningbo Chen, Weiwei Fu, Jianbang Zhao, Jiafei Shen, Qiuming Chen, Zhuqing Zheng, Hong Chen, Tad S. Sonstegard, Chuzhao Lei, Yu Jiang
Next-generation sequencing has yielded a vast amount of cattle genomic data for global characterization of population genetic diversity and identification of genomic regions under natural and artificial selection. However, efficient storage, querying, and visualization of such large datasets remain challenging. Here, we developed a comprehensive database, the Bovine Genome Variation Database (BGVD). It provides six main functionalities: gene search, variation search, genomic signature search, Genome Browser, alignment search tools, and the genome coordinate conversion tool. BGVD contains information on genomic variations comprising ~60.44 M SNPs, ~6.86 M indels, 76,634 CNV regions, and signatures of selective sweeps in 432 samples from modern cattle worldwide. Users can quickly retrieve distribution patterns of these variations for 54 cattle breeds through an interactive source of breed origin map, using a given gene symbol or genomic region for any of the three versions of the bovine reference genomes (ARS-UCD1.2, UMD3.1.1, and Btau 5.0.1). Signals of selection sweep are displayed as Manhattan plots and Genome Browser tracks. To further investigate and visualize the relationships between variants and signatures of selection, the Genome Browser integrates all variations, selection data, and resources, from NCBI, the UCSC Genome Browser, and Animal QTLdb. Collectively, all these features make the BGVD a useful archive for in-depth data mining and analyses of cattle biology and cattle breeding on a global scale. BGVD is publicly available at http://animal.nwsuaf.edu.cn/BosVar.
为了研究世界家牛的遗传多样性和鉴定基因组受选择区域，通过高通量测序技术已经产生了大量的牛基因组重测序数据。然而，对如此庞大的数据集进行有效的存储、查询和可视化仍然具有挑战性。本研究中，我们利用全世界54个家牛品种432个样本的重测序数据开发了牛基因组变异数据库 (the Bovine Genome Variation Database，BGVD)。BGVD包括六个主要功能: 基因快速检索、变异检索、基因组选择信号检索、基因组浏览器、比对工具和基因组坐标转换。BGVD存储了~60.44 M SNPs、~6.86 M indels、76,634个CNV以及六大群体的选择信号信息。用户可以通过搜索基因名或位置，快速检索三个参考基因组中(ARS-UCD1.2、UMD3.1.1和Btau 5.0.1) 54个牛品种的遗传变异分布模式和六个群体的选择信号。选择信号通过曼哈顿图和基因组浏览器来展示。基因组浏览器不仅包括基因组遗传变异和选择信号的信息，还整合了NCBI、UCSC基因组浏览器、动物QTLdb的资源。综上所述，所有这些特性使BGVD成为一个非常实用的牛基因组遗传变异数据库，可用于深度挖掘和分析全球范围内的牛基因组数据。BGVD的网址是http://animal.nwsuaf.edu.cn/BosVar。
10. HybridSucc: A Hybrid-learning Architecture for General and Species-specific Succinylation Site Prediction
Wanshan Ning, Haodong Xu, Peiran Jiang, Han Cheng, Wankun Deng, Yaping Guo, Yu Xue
As an important protein acylation modification, lysine succinylation (Ksucc) is involved in diverse biological processes, and participates in human tumorigenesis. Here, we collected 26,243 non-redundant known Ksucc sites from 13 species as the benchmark data set, combined 10 types of informative features, and implemented a hybrid-learning architecture by integrating deep-learning and conventional machine-learning algorithms into a single framework. We constructed a new tool named HybridSucc, which achieved area under curve (AUC) values of 0.885 and 0.952 for general and human-specific prediction of Ksucc sites, respectively. In comparison, the accuracy of HybridSucc was 17.84%–50.62% better than that of other existing tools. Using HybridSucc, we conducted a proteome-wide prediction and prioritized 370 cancer mutations that change Ksucc states of 218 important proteins, including PKM2, SHMT2, and IDH2. We not only developed a high-profile tool for predicting Ksucc sites, but also generated useful candidates for further experimental consideration. The online service of HybridSucc can be freely accessed for academic research at http://hybridsucc.biocuckoo.org/.
11. SuccSite: Incorporating Amino Acid Composition and Informative k-spaced Amino Acid Pairs to Identify Protein Succinylation Sites
Hui-Ju Kao, Van-Nui Nguyen, Kai-Yao Huang, Wen-Chi Chang, Tzong-Yi Lee
Protein succinylation is a biochemical reaction in which a succinyl group (-CO-CH2-CH2-CO-) is attached to the lysine residue of a protein molecule. Lysine succinylation plays important regulatory roles in living cells. However, studies in this field are limited by the difficulty in experimentally identifying the substrate site specificity of lysine succinylation. To facilitate this process, several tools have been proposed for the computational identification of succinylated lysine sites. In this study, we developed an approach to investigate the substrate specificity of lysine succinylated sites based on amino acid composition. Using experimentally verified lysine succinylated sites collected from public resources, the significant differences in position-specific amino acid composition between succinylated and non-succinylated sites were represented using the Two Sample Logo program. These findings enabled the adoption of an effective machine learning method, support vector machine, to train a predictive model with not only the amino acid composition, but also the composition of k-spaced amino acid pairs. After the selection of the best model using a ten-fold cross-validation approach, the selected model significantly outperformed existing tools based on an independent dataset manually extracted from published research articles. Finally, the selected model was used to develop a web-based tool, SuccSite, to aid the study of protein succinylation. Two proteins were used as case studies on the website to demonstrate the effective prediction of succinylation sites. We will regularly update SuccSite by integrating more experimental datasets. SuccSite is freely accessible at http://csb.cse.yzu.edu.tw/SuccSite/.