Article Online

Articles Online (Volume 21, Issue 2)

Database

OOCDB: A Comprehensive, Systematic, and Real-time Organs-on-a-chip Database

Jian Li , Weicheng Liang, Zaozao Chen, Xingyu Li, Pan Gu, Anna Liu, Pin Chen, Qiwei Li, Xueyin Mei, Jing Yang, Jun Liu, Lincao Jiang, Zhongze Gu

Organs-on-a-chip is a microfluidic microphysiological system that uses microfluidic technology to analyze the structure and function of living human cells at the tissue and organ levels in vitro. Organs-on-a-chip technology, as opposed to traditional two-dimensional cell culture and animal models, can more closely simulate pathologic and toxicologic interactions between different organs or tissues and reflect the collaborative response of multiple organs to drugs. Despite the fact that many organs-on-a-chip-related data have been published, none of the current databases have all of the following functions: searching, downloading, as well as analyzing data and results from the literature on organs-on-a-chip. Therefore, we created an organs-on-a-chip database (OOCDB) as a platform to integrate information about organs-on-a-chip from various sources, including literature, patents, raw data from microarray and transcriptome sequencing, several open-access datasets of organs-on-a-chip and organoids, and data generated in our laboratory. OOCDB contains dozens of sub-databases and analysis tools, and each sub-database contains various data associated with organs-on-a-chip, with the goal of providing researchers with a comprehensive, systematic, and convenient search engine. Furthermore, it offers a variety of other functions, such as mathematical modeling, three-dimensional modeling, and citation mapping, to meet the needs of researchers and promote the development of organs-on-a-chip. The OOCDB is available at http://www.organchip.cn.

Page 243–258


Database

TSNAdb v2.0: The Updated Version of Tumor-specific Neoantigen Database

Jingcheng Wu, Wenfan Chen, Yuxuan Zhou, Ying Chi, Xiansheng Hua, Jian Wu, Xun Gu, Shuqing Chen, Zhan Zhou

In recent years, neoantigens have been recognized as ideal targets for tumor immunotherapy. With the development of neoantigen-based tumor immunotherapy, comprehensive neoantigen databases are urgently needed to meet the growing demand for clinical studies. We have built the tumor-specific neoantigen database (TSNAdb) previously, which has attracted much attention. In this study, we provide TSNAdb v2.0, an updated version of the TSNAdb. TSNAdb v2.0 offers several new features, including (1) adopting more stringent criteria for neoantigen identification, (2) providing predicted neoantigens derived from three types of somatic mutations, and (3) collecting experimentally validated neoantigens and dividing them according to the experimental level. TSNAdb v2.0 is freely available at https://pgx.zju.edu.cn/tsnadb/.
研究问题 肿瘤新抗原是由肿瘤体细胞产生的肿瘤特异性抗原,突变多肽与主要组织相容性复合物(MHC)分子相结合,以蛋白质复合的形式存在于肿瘤细胞表面,可被T细胞表面受体(TCR)特异性识别,从而激活T细胞的免疫反应。因其仅存在于肿瘤细胞表面,可以成为区分肿瘤细胞和正常细胞的关键生物标志物。近年来,新抗原被认为是肿瘤免疫治疗的理想靶点,而成为研究热点。随着基于新抗原的肿瘤免疫治疗研究蓬勃发展,越来越多不同突变类型的新抗原已被发现。面对众多的实验数据,当前迫切需要综合全面的肿瘤特异性新抗原数据库来满足日益增长的临床研究需求。 研究方法 从癌症基因组图谱(The Cancer Genome Atlas,TCGA)采集单核苷酸位点变异(single nucleotide variants,SNV)、基因组中小片段的插入或缺失序列(Insertion and Deletion,INDEL)及相应基因的表达量,从Gao等人的研究中收集基因融合信息,由STAR-Fusion生成突变蛋白,从癌症映像数据库(The Cancer Imaging Archive,TCIA)采集相应样本的HLA等位基因。采用TSNAD v2.0等肿瘤新抗原生物信息学分析流程对数据进行预处理,对获得的数据采用三种新抗原预测工具(DeepHLApan,MHCflurry和NetMHCpan v4.0)进行预测,得到三种突变来源的新抗原。并通过文献检索系统收集经实验验证的肿瘤新抗原信息,对肿瘤特异性新抗原数据库TSNAdb进行了全面升级。 主要结果 TSNAdb v2.0数据库以更严格的新抗原预测标准对样本进行预测,提供了三种突变类型的预测肿瘤新抗原的相关信息。此外,数据库还包括了预测的肿瘤新生抗原和共享肿瘤新抗原,以及实验验证的肿瘤新抗原的相关信息。具体表现在: 1). 提供SNVs,INDELs和fusions来源的预测肿瘤新抗原。SNV产生的平均新抗原数(0.38)低于INDEL(1.22)和Fusion(0.88);2).只有满足三个工具(DeepHLApan, MHCflurry和NetMHCpan4.0)阈值的多肽和HLA分型组合才能被认为是潜在的肿瘤新抗原。共获得SNV来源的肿瘤新抗原372,273个,INDEL来源的肿瘤新抗原137,130个,Fusion来源的肿瘤新抗原11093个;3). 提供预测的共享肿瘤新抗原和实验验证的肿瘤新抗原的信息。这些肿瘤新抗原是泛用性和可信度更高的潜在免疫治疗靶点。来自驱动基因或位点的肿瘤新抗原是理想的肿瘤免疫治疗靶点,我们将这些新抗原中相应的基因和突变与我们之前创建的癌症驱动突变位点数据库CandrisDB (http://biopharm.zju.edu.cn/candrisdb)相关联,提供了更有信服力的信息。

Page 259-266


Database

iHypoxia: An Integrative Database of Protein Expression Dynamics in Response to Hypoxia in Animals

Ze-Xian Liu, Panqin Wang, Qingfeng Zhang, Shihua Li, Yuxin Zhang, Yutong Guo, Chongchong Jia, Tian Shao, Lin Li, Han Cheng, Zhenlong Wang

Mammals have evolved mechanisms to sense hypoxia and induce hypoxic responses. Recently, high-throughput techniques have greatly promoted global studies of protein expression changes during hypoxia and the identification of candidate genes associated with hypoxia-adaptive evolution, which have contributed to the understanding of the complex regulatory networks of hypoxia. In this study, we developed an integrated resource for the expression dynamics of proteins in response to hypoxia (iHypoxia), and this database contains 2589 expression events of 1944 proteins identified by low-throughput experiments (LTEs) and 422,553 quantitative expression events of 33,559 proteins identified by high-throughput experiments from five mammals that exhibit a response to hypoxia. Various experimental details, such as the hypoxic experimental conditions, expression patterns, and sample types, were carefully collected and integrated. Furthermore, 8788 candidate genes from diverse species inhabiting low-oxygen environments were also integrated. In addition, we conducted an orthologous search and computationally identified 394,141 proteins that may respond to hypoxia among 48 animals. An enrichment analysis of human proteins identified from LTEs shows that these proteins are enriched in certain drug targets and cancer genes. Annotation of known posttranslational modification (PTM) sites in the proteins identified by LTEs reveals that these proteins undergo extensive PTMs, particularly phosphorylation, ubiquitination, and acetylation. iHypoxia provides a convenient and user-friendly method for users to obtain hypoxia-related information of interest. We anticipate that iHypoxia, which is freely accessible at https://ihypoxia.omicsbio.info, will advance the understanding of hypoxia and serve as a valuable data resource.
研究问题: 动物低氧响应过程的蛋白表达动态数据的收集整合及数据库构建。 研究方法: 从已发表文献、HypoxiaDB和GO等数据库挖掘低氧响应蛋白的表达事件和低氧适应相关的候选基因,从GEO数据库搜索低氧相关数据集并进行重新分析,对所有数据进行整合并去冗余,构建了开源的动物低氧响应的蛋白表达动态数据库iHypoxia。 主要成果1: iHypoxia包含通过低通量实验(LTE)鉴定的1944个低氧相关蛋白的2589个定量表达事件,和通过高通量实验(HTE)鉴定的响应低氧的33,559个蛋白质的422,553个定量表达事件;通过基因组分析鉴定的8788个低氧适应相关的候选基因;以及48种动物的394,141个直系同源蛋白。 主要成果2: iHypoxia为低氧响应蛋白提供了丰富注释,包括低氧实验条件、蛋白质表达模式、样本类型、蛋白质翻译后修饰、亚细胞定位、蛋白-蛋白相互作用和药物-靶标关系等信息。 数据库链接: http://ihypoxia.omicsbio.info

Page 267-277


Database

ncFO: A Comprehensive Resource of Curated and Predicted ncRNAs Associated with Ferroptosis

Shunheng Zhou, Yu’e Huang, Jiani Xing, Xu Zhou, Sina Chen, Jiahao Chen, Lihong Wang, Wei Jiang

Ferroptosis is a form of regulated cell death driven by the accumulation of lipid hydroperoxides. Regulation of ferroptosis might be beneficial to cancer treatment. Non-coding RNAs (ncRNAs) are a class of RNA transcripts that generally cannot encode proteins and have been demonstrated to play critical roles in regulating ferroptosis. Herein, we developed ncFO, the ncRNA–ferroptosis association database, to document the manually curated and predicted ncRNAs that are associated with ferroptosis. Collectively, ncFO contains 90 experimentally verified entries, including 46 microRNAs (miRNAs), 21 long non-coding RNAs (lncRNAs), and 17 circular RNAs (circRNAs). In addition, ncFO also incorporates two online prediction tools based on the regulation and co-expression of ncRNA and ferroptosis genes. Using default parameters, we obtained 3260 predicted entries, including 598 miRNAs and 178 lncRNAs, by regulation, as well as 2,592,661 predicted entries, including 967 miRNAs and 9632 lncRNAs, by ncRNA–ferroptosis gene co-expression in more than 8000 samples across 20 cancer types. The detailed information of each entry includes ncRNA name, disease, species, tissue, target, regulation, publication time, and PubMed identifier. ncFO also provides survival analysis and differential expression analysis for ncRNAs. In summary, ncFO offers a user-friendly platform to search and predict ferroptosis-associated ncRNAs, which might facilitate research on ferroptosis and discover potential targets for cancer treatment. ncFO can be accessed at http://www.jianglab.cn/ncFO/.
研究问题: 铁死亡是近年来发现的一类受调控的细胞死亡方式,ncRNA在铁死亡过程发挥重要的调控作用。目前,已有大量的文献报道铁死亡相关ncRNA的研究,但缺乏预测工具进行铁死亡相关ncRNA的预测,亟需系统收集实验证实铁死亡相关ncRNA,并对铁死亡相关ncRNA进行预测,构建方便易用的在线分析平台。 研究方案: 1. 通过PubMed文献检索,收录实验证实的铁死亡相关ncRNA。 2. 基于实验证实的ncRNA与铁死亡基因调控关系,预测候选铁死亡相关ncRNA。 3. 基于ncRNA与铁死亡基因的共表达信息,在多种癌症中预测铁死亡相关ncRNA。 主要成果1: 经过PubMed关键词的检索,在约200篇相关文献中获得了90个实验证实的条目,其中包含46个miRNA(microRNA,微小RNA)、21个lncRNA(long non-coding RNAs,长非编码RNA)和17个circRNA(circular RNAs,环状RNA)。 主要结果2: 基于ncRNA与铁死亡基因的调控关系预测了铁死亡相关ncRNA,获得了3260个预测的条目,包括598个miRNA和178个lncRNA。 主要结果3: 基于ncRNA与铁死亡基因的共表达信息,在TCGA数据库20种癌症中预测了2,592,661个条目,包含967个miRNA和9632个lncRNA。 数据库链接: http://www.jianglab.cn/ncFO/

Page 278-282


Database

RNA2Immune: A Database of Experimentally Supported Data Linking Non-coding RNA Regulation to The Immune System

Jianjian Wang, Shuang Li, Tianfeng Wang, Si Xu, Xu Wang, Xiaotong Kong, Xiaoyu Lu, Huixue Zhang, Lifang Li, Meng Feng, Shangwei Ning, Lihua Wang

Non-coding RNAs (ncRNAs), such as microRNAs (miRNAs), long non-coding RNAs (lncRNAs), and circular RNAs (circRNAs), have emerged as important regulators of the immune system and are involved in the control of immune cell biology, disease pathogenesis, as well as vaccine responses. A repository of ncRNA–immune associations will facilitate our understanding of ncRNA-dependent mechanisms in the immune system and advance the development of therapeutics and prevention for immune disorders. Here, we describe a comprehensive database, RNA2Immune, which aims to provide a high-quality resource of experimentally supported database linking ncRNA regulatory mechanisms to immune cell function, immune disease, cancer immunology, and vaccines. The current version of RNA2Immune documents 50,433 immune–ncRNA associations in 42 host species, including (1) 6690 ncRNA associations with immune functions involving 31 immune cell types; (2) 38,672 ncRNA associations with 348 immune diseases; (3) 4833 ncRNA associations with cancer immunology; and (4) 238 ncRNA associations with vaccine responses involving 26 vaccine types targeting 22 diseases. RNA2Immune provides a user-friendly interface for browsing, searching, and downloading ncRNA–immune system associations. Collectively, RNA2Immune provides important information about how ncRNAs influence immune cell function, how dysregulation of these ncRNAs leads to pathological consequences (immune diseases and cancers), and how ncRNAs affect immune responses to vaccines. RNA2Immune is available at http://bio-bigdata.hrbmu.edu.cn/rna2immune/home.jsp.
研究问题:系统收集实验证实的免疫系统相关非编码RNA(non-coding RNAs, ncRNAs),构建ncRNA-免疫相关数据库。 研究背景: ncRNA参与免疫细胞的分化发育和功能,并在免疫系统稳态的维持、自身免疫性疾病的发生发展和宿主疫苗应答中发挥重要的作用,系统研究免疫相关ncRNA将有助于我们更好的理解ncRNA功能和致病机制。我们开发了RNA2Immune数据库,提供实验支持的免疫细胞功能、免疫疾病、癌症免疫和疫苗应答相关的ncRNA数据。 主要结果1:系统的收集经实验证实的ncRNAs与免疫系统关联,构建ncRNA-免疫相关数据库RNA2Immune。 主要结果2:RNA2Immune提供了一个友好的界面,方便研究者浏览、搜索和下载ncRNA-免疫关联数据。 数据库链接:http://bio-bigdata.hrbmu.edu.cn/rna2immune/home.jsp.

Page 283-291


Database

CTRR-ncRNA: A Knowledgebase for Cancer Therapy Resistance and Recurrence Associated Non-coding RNAs

Tong Tang, Xingyun Liu, Rongrong Wu, Li Shen, Shumin Ren, Bairong Shen

Cancer therapy resistance and recurrence (CTRR) are the dominant causes of death in cancer patients. Recent studies have indicated that non-coding RNAs (ncRNAs) can not only reverse the resistance to cancer therapy but also are crucial biomarkers for the evaluation and prediction of CTRR. Herein, we developed CTRR-ncRNA, a knowledgebase of CTRR-associated ncRNAs, aiming to provide an accurate and comprehensive resource for research involving the association between CTRR and ncRNAs. Compared to most of the existing cancer databases, CTRR-ncRNA is focused on the clinical characterization of cancers, including cancer subtypes, as well as survival outcomes and responses to personalized therapy of cancer patients. Information pertaining to biomarker ncRNAs has also been documented for the development of personalized CTRR prediction. A user-friendly interface and several functional modules have been incorporated into the database. Based on the preliminary analysis of genotype–phenotype relationships, universal ncRNAs have been found to be potential biomarkers for CTRR. The CTRR-ncRNA is a translation-oriented knowledgebase and it provides a valuable resource for mechanistic investigations and explainable artificial intelligence-based modeling. CTRR-ncRNA is freely available to the public at http://ctrr.bioinf.org.cn/.
研究问题: 癌症的耐治疗和复发(Cancer therapy resistance and recurrence, CTRR)是导致癌症患者死亡的重要原因。近些年,越来越多的研究表明非编码RNA在癌症的耐治疗和复发的过程中发挥着重要的作用。但是,想要全面了解该领域得研究结果却很困难。因此,我们计划整合已发表的文献,建立一个关于癌症耐治疗和复发的知识库,详细整合非编码RNA的作用,以期推动研究者对癌症耐治疗和复发的全面认识,并为基础研究和临床实践提供参考 研究方法: 我们人工整理和挖掘了3998篇已发表的文献,收集了与癌症耐治疗和复发相关的非编码RNA数据、癌症数据和临床数据等多种数据。我们对非编码RNA进行了统一的命名和注释,同时整合了非编码RNA的上游调控因子和下游靶点。最终,我们建立了一个名为CTRR-ncRNA的详细的非编码RNA知识库,该知识库目前是最全面的癌症耐治疗和复发的非编码RNA资源。 主要发现: 我们构建的CTRR-ncRNA知识库中包含了367种与癌症耐治疗相关和46种与癌症复发相关的非编码RNA。在这个数据库中,我们还建立了癌症和非编码RNA之间的网络,通过分析我们发现,在不同类型的癌症中,共有的非编码RNA更有可能成为标志癌症耐治疗的生物标志物。我们进一步使用无标度网络模型分析了与癌症耐治疗相关的非编码RNA网络,也发现不同癌症共用的miRNA和lncRNA更有可能成为生物标志物。

Page 292-299


Database

VIS Atlas: A Database of Virus Integration Sites in Human Genome from NGS Data to Explore Integration Patterns

Ye Chen, Yuyan Wang, Ping Zhou, Hao Huang, Rui Li, Zhen Zeng, Zifeng Cui, Rui Tian, Zhuang Jin, Jiashuo Liu, Zhaoyue Huang, Lifang Li, Zheying Huang, Xun Tian, Meiying Yu, Zheng Hu

Integration of oncogenic DNA viruses into the human genome is a key step in most virus-induced carcinogenesis. Here, we constructed a virus integration site (VIS) Atlas database, an extensive collection of integration breakpoints for three most prevalent oncoviruses, human papillomavirus, hepatitis B virus, and Epstein–Barr virus based on the next-generation sequencing (NGS) data, literature, and experimental data. There are 63,179 breakpoints and 47,411 junctional sequences with full annotations deposited in the VIS Atlas database, comprising 47 virus genotypes and 17 disease types. The VIS Atlas database provides (1) a genome browser for NGS breakpoint quality check, visualization of VISs, and the local genomic context; (2) a novel platform to discover integration patterns; and (3) a statistics interface for a comprehensive investigation of genotype-specific integration features. Data collected in the VIS Atlas aid to provide insights into virus pathogenic mechanisms and the development of novel antitumor drugs. The VIS Atlas database is available at http://www.vis-atlas.tech/.
研究问题: 致癌病毒整合到人类基因组中是大多数病毒诱导致癌的关键步骤。病毒整合可诱导基因组不稳定,也可诱发病毒-人类融合转录本形成,成为肿瘤发生发展的驱动因素。然而,现有的病毒整合数据库存储信息仅提供了人类基因组中的病毒整合位置,并未提供详细的人类-病毒连接序列,且缺乏对整合模式的深入探索,需要整合测序资源并构建数据库。 研究方法: 从公共数据库、已发表文献、内部样本和细胞系测序数据收集常见致癌病毒的整合事件的具体信息并进行重新分析,探索了整合位点信息、人类-病毒连接序列及整合模式,并根据不同病毒基因型、不同疾病进行分析,构建了致癌病毒整合特征(VIS Atlas)数据库。 主要成果1: VIS Atlas数据库广泛收集了人乳头状瘤病毒(HPV)、乙型肝炎病毒(HBV)和 Epstein-Barr 病毒(EBV)这三种最常见的致癌病毒的多种来源的测序信息,包括:(1)公共数据库,包括TCGA数据库、SRA数据库和EBI数据库(2)已发表文献(3)内部样本NGS测序数据,共获得63,179 个断点和 47,411 个带有完整注释的连接序列,覆盖了47 种病毒基因型和 17 种疾病类型。VIS Atlas是迄今为止最大的DNA病毒整合数据库。 主要成果2: VIS Atlas数据库为致癌病毒整合位点提供了丰富的注释,包括不同病毒基因型及不同疾病类型的整合位点、人类-病毒连接序列、断点附近的基因信息以及聚类分析获得相关热点基因。此外,该数据还提供多维度的信息界面(1)具备整合位点可视化、二代测序(next generation sequencing,NGS)断点质量检查及基因背景展示功能的基因信息浏览器(2)作为探索整合模式的创新性平台(3)具备全面研究特异性整合特征的统计界面。上述功能使得该数据库有望作为宝贵的数据资源,促进对致癌病毒整合特征的深入了解和新型抗肿瘤药物的开发。 数据库链接: http://www.vis-atlas.tech/

Page 300-310


Database

PlantCADB: A Comprehensive Plant Chromatin Accessibility Database

Ke Ding, Shanwen Sun, Yang Luo, Chaoyue Long, Jingwen Zhai, Yixiao Zhai, Guohua Wang

Chromatin accessibility landscapes are essential for detecting regulatory elements, illustrating the corresponding regulatory networks, and, ultimately, understanding the molecular basis underlying key biological processes. With the advancement of sequencing technologies, a large volume of chromatin accessibility data has been accumulated and integrated for humans and other mammals. These data have greatly advanced the study of disease pathogenesis, cancer survival prognosis, and tissue development. To advance the understanding of molecular mechanisms regulating plant key traits and biological processes, we developed a comprehensive plant chromatin accessibility database (PlantCADB) from 649 samples of 37 species. These samples are abiotic stress-related (such as heat, cold, drought, and salt; 159 samples), development-related (232 samples), and/or tissue-specific (376 samples). Overall, 18,339,426 accessible chromatin regions (ACRs) were compiled. These ACRs were annotated with genomic information, associated genes, transcription factor footprint, motif, and single-nucleotide polymorphisms (SNPs). Additionally, PlantCADB provides various tools to visualize ACRs and corresponding annotations. It thus forms an integrated, annotated, and analyzed plant-related chromatin accessibility resource, which can aid in better understanding genetic regulatory networks underlying development, important traits, stress adaptations, and evolution. PlantCADB is freely available at https://bioinfor.nefu.edu.cn/PlantCADB/.

Page 311-323


Database

WheatCENet: A Database for Comparative Co-expression Networks Analysis of Allohexaploid Wheat and Its Progenitors

Zhongqiu Li, Yiheng Hu, Xuelian Ma, Lingling Da, Jiajie She, Yue Liu, Xin Yi, Yaxin Cao, Wenying Xu, Yuannian Jiao, Zhen Su

Genetic and epigenetic changes after polyploidization events could result in variable gene expression and modified regulatory networks. Here, using large-scale transcriptome data, we constructed co-expression networks for diploid, tetraploid, and hexaploid wheat species, and built a platform for comparing co-expression networks of allohexaploid wheat and its progenitors, named WheatCENet. WheatCENet is a platform for searching and comparing specific functional co-expression networks, as well as identifying the related functions of the genes clustered therein. Functional annotations like pathways, gene families, protein–protein interactions, microRNAs (miRNAs), and several lines of epigenome data are integrated into this platform, and Gene Ontology (GO) annotation, gene set enrichment analysis (GSEA), motif identification, and other useful tools are also included. Using WheatCENet, we found that the network of WHEAT ABERRANT PANICLE ORGANIZATION 1 (WAPO1) has more co-expressed genes related to spike development in hexaploid wheat than its progenitors. We also found a novel motif of CCWWWWWWGG (CArG) specifically in the promoter region of WAPO-A1, suggesting that neofunctionalization of the WAPO-A1 gene affects spikelet development in hexaploid wheat. WheatCENet is useful for investigating co-expression networks and conducting other analyses, and thus facilitates comparative and functional genomic studies in wheat. WheatCENet is freely available at http://bioinformatics.cpolar.cn/WheatCENet and http://bioinformatics.cau.edu.cn/WheatCENet.

Page 324-336


Web Server

TIGER: A Web Portal of Tumor Immunotherapy Gene Expression Resource

Zhihang Chen, Ziwei Luo, Di Zhang, Huiqin Li, Xuefei Liu, Kaiyu Zhu, Hongwan Zhang, Zongping Wang, Penghui Zhou, Jian Ren, An Zhao, Zhixiang Zuo

Immunotherapy is a promising cancer treatment method; however, only a few patients benefit from it. The development of new immunotherapy strategies and effective biomarkers of response and resistance is urgently needed. Recently, high-throughput bulk and single-cell gene expression profiling technologies have generated valuable resources. However, these resources are not well organized and systematic analysis is difficult. Here, we present TIGER, a tumor immunotherapy gene expression resource, which contains bulk transcriptome data of 1508 tumor samples with clinical immunotherapy outcomes and 11,057 tumor/normal samples without clinical immunotherapy outcomes, as well as single-cell transcriptome data of 2,116,945 immune cells from 655 samples. TIGER provides many useful modules for analyzing collected and user-provided data. Using the resource in TIGER, we identified a tumor-enriched subset of CD4+ T cells. Patients with melanoma with a higher signature score of this subset have a significantly better response and survival under immunotherapy. We believe that TIGER will be helpful in understanding anti-tumor immunity mechanisms and discovering effective biomarkers. TIGER is freely accessible at http://tiger.canceromics.org/.
癌症是目前全球主要死亡病因之一,而癌症的免疫治疗是一种十分有应用前景的癌症治疗方法,但是癌症的免疫治疗只有少部分人能够受益。目前,医学上尚不能准确判断癌症患者是否能对免疫治疗产生应答,预测癌症免疫治疗的有效性,从而实现病人的精准治疗,所以迫切需要开发更为有效的免疫治疗疗效预测的生物标记物,找出适应性广的免疫治疗策略。近年来,癌症免疫治疗的相关的高通量测序越来越多,大量癌症免疫治疗测序数据为研究抗肿瘤免疫治疗提供了宝贵资源。但目前仍然缺少一个泛癌种水平的肿瘤免疫治疗相关的转录组学数据整合分析的平台。为了整合高通量癌症免疫治疗相关的数据,帮助科研工作者更好的进行肿瘤免疫治疗研究,我们构建了一个研究泛癌种肿瘤免疫治疗的数据库——TIGER (Tumor Immunotherapy Gene Expression Resource),用以实现对癌症免疫治疗相关多种转录组测序数据的整合和综合分析。之后,通过对TIGER中的数据进行整合分析,我们找到了在肿瘤组织中特异性富集的CD4+T细胞的亚群,这群细胞特异性高表达CXCL13,ITM2A,NR3C1,SRGN,COTL1和PDCD1等基因。此外,通过分析发现这群细胞可以起到抗肿瘤免疫调节的作用,在肿瘤免疫治疗相关的转录组数据集中,我们将这群细胞中高表达的特征基因集作为标签进行验证,发现这群细胞的特征基因集在免疫治疗应答组与非应答组之间有显著性差异,高表达这些特征基因集的病人有更好的预后。最后,通过与CD274,CD8等免疫治疗预测标签进行比较,发现这组肿瘤组织特异性富集的CD4+T细胞亚群的特征基因能够更为有效地预测免疫治疗的应答。TIGER可在http://tiger.canceromics.org/上进行免费访问。

Page 337-348


Web Server

NetGO 3.0: Protein Language Model Improves Large-scale Functional Annotations

Shaojun Wang, Ronghui You, Yunjia Liu, Yi Xiong, Shanfeng Zhu

As one of the state-of-the-art automated function prediction (AFP) methods, NetGO 2.0 integrates multi-source information to improve the performance. However, it mainly utilizes the proteins with experimentally supported functional annotations without leveraging valuable information from a vast number of unannotated proteins. Recently, protein language models have been proposed to learn informative representations [e.g., Evolutionary Scale Modeling (ESM)-1b embedding] from protein sequences based on self-supervision. Here, we represented each protein by ESM-1b and used logistic regression (LR) to train a new model, LR-ESM, for AFP. The experimental results showed that LR-ESM achieved comparable performance with the best-performing component of NetGO 2.0. Therefore, by incorporating LR-ESM into NetGO 2.0, we developed NetGO 3.0 to improve the performance of AFP extensively. NetGO 3.0 is freely accessible at https://dmiip.sjtu.edu.cn/ng3.0.
研究问题: 蛋白质是有机体中生命活动的主要承担者,了解蛋白质的功能具有重要的生物医学意义。随着测序技术的发展,已知序列的蛋白质数量急剧增加。但由于依靠生化实验的方式测定蛋白质的功能耗时耗力,目前只有不到0.1%的蛋白质存在实验验证的功能标注。因此,设计一种高精度的算法来实现蛋白质功能自动预测显得极为重要。 研究方法: NetGO 3.0 (https://dmiip.sjtu.edu.cn/ng3.0/)借助Meta开发的蛋白质语言大模型ESM-1b,为蛋白质生成富含生化信息的特征表示,同时收集了超过十万条蛋白的功能注释数据,为每个功能标签训练一个独立的分类器。除此之外,它针对蛋白质相关的同源、家族、域、网络、文献等信息训练单独的分类器,最终通过排序模型将各种方法整合起来,以实现蛋白质功能预测。 主要成果: 在标准测试集上,基于ESM-1b的组件方法表现优异,这也使得NetGO 3.0成为预测准确率最高的方法。NetGO系列服务器自2019年7月发布以来,已经为超过2百万蛋白质预测功能,为广大生物医学研究人员提供便利。

Page 349-358


Method

TransDFL: Identification of Disordered Flexible Linkers in Proteins by Transfer Learning

Yihe Pang, Bin Liu

Disordered flexible linkers (DFLs) are the functional disordered regions in proteins, which are the sub-regions of intrinsically disordered regions (IDRs) and play important roles in connecting domains and maintaining inter-domain interactions. Trained with the limited available DFLs, the existing DFL predictors based on the machine learning techniques tend to predict the ordered residues as DFLs, leading to a high false positive rate (FPR) and low prediction accuracy. Previous studies have shown that DFLs are extremely flexible disordered regions, which are usually predicted as disordered residues with high confidence [P(D) > 0.9] by an IDR predictor. Therefore, transferring an IDR predictor to an accurate DFL predictor is of great significance for understanding the functions of IDRs. In this study, we proposed a new predictor called TransDFL for identifying DFLs by transferring the RFPR-IDP predictor for IDR identification to the DFL prediction. The RFPR-IDP was pre-trained with IDR sequences to learn the general features between IDRs and DFLs, which is helpful to reduce the false positives in the ordered regions. RFPR-IDP was fine-tuned with the DFL sequences to capture the specific features of DFLs so as to be transferred into the TransDFL. Experimental results of two application scenarios (prediction of DFLs only in IDRs or prediction of DFLs in entire proteins) showed that TransDFL consistently outperformed other existing DFL predictors with higher accuracy. The corresponding web server of TransDFL can be freely accessed at http://bliulab.net/TransDFL/.
研究问题:蛋白质无序柔性链接器识别 研究方法:提出基于蛋白质序列的无序柔性链接器计算预测方法。无序柔性链接器是蛋白质序列中具有高度柔性的无序片段,通常被蛋白质固有无序预测方法识别为高置信度的无序区域。蛋白质结构和功能信息都编码在氨基酸序列中,类似于自然语言处理领域中机器翻译的源语言与目标语言都表达相同的语义。在机器翻译中,通过迁移学习技术可以将在大规模源语言数据上预训练的翻译模型转移到目标语言的翻译。蛋白质序列与自然语言句子存在着相似性:例如氨基酸通过肽键组成具有特定结构和功能的蛋白质序列,类似地,字词通过语法规则组成具有特定语义的句子。因此自然语言处理的理论和思想可以用于分析蛋白质序列,进而破译“生命天书”的语义。受自然语言与蛋白质序列间相似性的启发,并依据蛋白质无序柔性链接器的属性特征,利用机器翻译中迁移学习方法对蛋白质固有无序识别预测器RFPR-IDP进行迁移,构建用于无序柔性链接器预测方法:TransDFL。 主要成果1: 采用双向长短期记忆和卷积神经网络结合的模型架构,编码蛋白质序列的局部和长距离上下文语义信息。 主要成果2: 无序区域预测模型的预训练和迁移,捕获了无序柔性链接器和无序区域间的共性特征,减少模型在有序区域的假阳性预测,显著降低了模型的误判率。 主要成果3: 在面向无序区域范围和整个蛋白质序列的两种场景下,验证了TransDFL方法的有效性。 主要成果4: 提供用户友好的在线计算预测服务:http://bliulab.net/TransDFL/ 数据链接:开放源代码通过中国国家生物信息中心国家基因组科学数据中心获取:https://ngdc.cncb.ac.cn/biocode/tools/BT007312。实验数据可在http://bliulab.net/TransDFL/benchmark/下载。

Page 359-369


Method

deCS: A Tool for Systematic Cell Type Annotations of Single-cell RNA Sequencing Data among Human Tissues

Guangsheng Pei, Fangfang Yan, Lukas M. Simon, Yulin Dai, Peilin Jia, Zhongming Zhao

Single-cell RNA sequencing (scRNA-seq) is revolutionizing the study of complex and dynamic cellular mechanisms. However, cell type annotation remains a main challenge as it largely relies on a priori knowledge and manual curation, which is cumbersome and subjective. The increasing number of scRNA-seq datasets, as well as numerous published genetic studies, has motivated us to build a comprehensive human cell type reference atlas. Here, we present decoding Cell type Specificity (deCS), an automatic cell type annotation method augmented by a comprehensive collection of human cell type expression profiles and marker genes. We used deCS to annotate scRNA-seq data from various tissue types and systematically evaluated the annotation accuracy under different conditions, including reference panels, sequencing depth, and feature selection strategies. Our results demonstrate that expanding the references is critical for improving annotation accuracy. Compared to many existing state-of-the-art annotation tools, deCS significantly reduced computation time and increased accuracy. deCS can be integrated into the standard scRNA-seq analytical pipeline to enhance cell type annotation. Finally, we demonstrated the broad utility of deCS to identify trait–cell type associations in 51 human complex traits, providing deep insights into the cellular mechanisms underlying disease pathogenesis. All documents for deCS, including source code, user manual, demo data, and tutorials, are freely available at https://github.com/bsml320/deCS.

Page 370-384


Method

RegVar: Tissue-specific Prioritization of Non-coding Regulatory Variants

Hao Lu, Luyu Ma, Cheng Quan, Lei Li, Yiming Lu, Gangqiao Zhou, Chenggang Zhang

Non-coding genomic variants constitute the majority of trait-associated genome variations; however, the identification of functional non-coding variants is still a challenge in human genetics, and a method for systematically assessing the impact of regulatory variants on gene expression and linking these regulatory variants to potential target genes is still lacking. Here, we introduce a deep neural network (DNN)-based computational framework, RegVar, which can accurately predict the tissue-specific impact of non-coding regulatory variants on target genes. We show that by robustly learning the genomic characteristics of massive variant–gene expression associations in a variety of human tissues, RegVar vastly surpasses all current non-coding variant prioritization methods in predicting regulatory variants under different circumstances. The unique features of RegVar make it an excellent framework for assessing the regulatory impact of any variant on its putative target genes in a variety of tissues. RegVar is available as a web server at https://regvar.omic.tech/.
RegVar采用深度神经网络(DNN)的算法框架,利用来自基因型-组织表达(GTEx)研究项目的组织类型特异性表达数量性状基因座(eQTL)数据,结合了突变位点及其所调控的靶基因的序列、表观组学和进化保守性等特征,在17种人体组织中构建了组织特异性的非编码区调控型突变的预测模型。与以往方法相比,RegVar在调控型突变的鉴定上表现出更好的预测性能(图2)。在RegVar的可应用性方面,研究者采用RegVar对22号常染色体上所有的单核苷酸变异位点进行了调控概率的注释,结果显示其中存在大量具有高调控功能概率的变异位点,可能影响到特定靶基因的表达。在真实的eQTL研究中,这些位点并不能被成功检测出来,可能是由于这些位点的调控效应十分微弱而导致的,此外也可能受到样本量与统计效力等限制因素的影响。研究者随后使用RegVar模型对全基因组中随机选取的变异位点进行了组织特异性预测分析,鉴定到跨组织与组织特异性调控型突变位点,对其进行表观特征注释的结果显示,跨组织调控型突变位点往往带有多个组织的启动子表观修饰,而组织特异性调控型突变位点则大多带有组织特异性的增强子表观修饰(图3)。为了进一步探究RegVar框架的可拓展性,研究者利用人类基因突变数据库(HGMD)中的致病型突变位点信息,构建了简化的致病型突变预测模型。与已发表的同类方法相比,RegVar可达到同等程度的预测性能。 RegVar同时提供了可在线访问的网页应用(https://regvar.omic.tech/)与可下载的模型程序包供相关领域的研究者使用和参考。RegVar有望应用于候选突变位点的筛选、靶基因的鉴定等研究,为揭示基因组中复杂的调控关系以及阐明复杂性状的分子成因提供帮助。

Page 385-395


Method

inMTSCCA: An Integrated Multi-task Sparse Canonical Correlation Analysis for Multi-omic Brain Imaging Genetics

Lei Du, Jin Zhang, Ying Zhao, Muheng Shang, Lei Guo, Junwei Han, The Alzheimer's Disease Neuroimaging Initiative

Identifying genetic risk factors for Alzheimer’s disease (AD) is an important research topic. To date, different endophenotypes, such as imaging-derived endophenotypes and proteomic expression-derived endophenotypes, have shown the great value in uncovering risk genes compared to case–control studies. Biologically, a co-varying pattern of different omics-derived endophenotypes could result from the shared genetic basis. However, existing methods mainly focus on the effect of endophenotypes alone; the effect of cross-endophenotype (CEP) associations remains largely unexploited. In this study, we used both endophenotypes and their CEP associations of multi-omic data to identify genetic risk factors, and proposed two integrated multi-task sparse canonical correlation analysis (inMTSCCA) methods, i.e., pairwise endophenotype correlation-guided MTSCCA (pcMTSCCA) and high-order endophenotype correlation-guided MTSCCA (hocMTSCCA). pcMTSCCA employed pairwise correlations between magnetic resonance imaging (MRI)-derived, plasma-derived, and cerebrospinal fluid (CSF)-derived endophenotypes as an additional penalty. hocMTSCCA used high-order correlations among these multi-omic data for regularization. To figure out genetic risk factors at individual and group levels, as well as altered endophenotypic markers, we introduced sparsity-inducing penalties for both models. We compared pcMTSCCA and hocMTSCCA with three related methods on both simulation and real (consisting of neuroimaging data, proteomic analytes, and genetic data) datasets. The results showed that our methods obtained better or comparable canonical correlation coefficients (CCCs) and better feature subsets than benchmarks. Most importantly, the identified genetic loci and heterogeneous endophenotypic markers showed high relevance. Therefore, jointly using multi-omic endophenotypes and their CEP associations is promising to reveal genetic risk factors. The source code and manual of inMTSCCA are available at https://ngdc.cncb.ac.cn/biocode/tools/BT007330.

Page 396-413


Application Note

mvPPT: A Highly Efficient and Sensitive Pathogenicity Prediction Tool for Missense Variants

Shi-Yuan Tong, Ke Fan, Zai-Wei Zhou, Lin-Yun Liu, Shu-Qing Zhang, Yinghui Fu, Guang-Zhong Wang, Ying Zhu, Yong-Chun Yu

Next-generation sequencing technologies both boost the discovery of variants in the human genome and exacerbate the challenges of pathogenic variant identification. In this study, we developed Pathogenicity Prediction Tool for missense variants (mvPPT), a highly sensitive and accurate missense variant classifier based on gradient boosting. mvPPT adopts high-confidence training sets with a wide spectrum of variant profiles, and extracts three categories of features, including scores from existing prediction tools, frequencies (allele frequencies, amino acid frequencies, and genotype frequencies), and genomic context. Compared with established predictors, mvPPT achieves superior performance in all test sets, regardless of data source. In addition, our study also provides guidance for training set and feature selection strategies, as well as reveals highly relevant features, which may further provide biological insights into variant pathogenicity. mvPPT is freely available at http://www.mvppt.club/.

Page 414-426