Article Online

Articles Online (Volume 21, Issue 1)

Review Article

What Has Genomics Taught An Evolutionary Biologist?

Jianzhi Zhang

Genomics, an interdisciplinary field of biology on the structure, function, and evolution of genomes, has revolutionized many subdisciplines of life sciences, including my field of evolutionary biology, by supplying huge data, bringing high-throughput technologies, and offering a new approach to biology. In this review, I describe what I have learned from genomics and highlight the fundamental knowledge and mechanistic insights gained. I focus on three broad topics that are central to evolutionary biology and beyond—variation, interaction, and selection—and use primarily my own research and study subjects as examples. In the next decade or two, I expect that the most important contributions of genomics to evolutionary biology will be to provide genome sequences of nearly all known species on Earth, facilitate high-throughput phenotyping of natural variants and systematically constructed mutants for mapping genotype–phenotype–fitness landscapes, and assist the determination of causality in evolutionary processes using experimental evolution.

Page 1-12

Review Article

Integration of Computational Analysis and Spatial Transcriptomics in Single-cell Studies

Ran Wang, Guangdun Peng, Patrick P.L. Tam, Naihe Jing

Recent advances of single-cell transcriptomics technologies and allied computational methodologies have revolutionized molecular cell biology. Meanwhile, pioneering explorations in spatial transcriptomics have opened up avenues to address fundamental biological questions in health and diseases. Here, we review the technical attributes of single-cell RNA sequencing and spatial transcriptomics, and the core concepts of computational data analysis. We further highlight the challenges in the application of data integration methodologies and the interpretation of the biological context of the findings.
近些年来,单细胞转录组技术极大的促进了分子细胞生物学研究的进展。伴随着单细胞转录组技术的飞速发展,单细胞测序实验产生的数据量呈指数型增长。针对单细胞研究中的海量数据,本文首先详细总结了数据分析流程和代表性的分析方法,包括数据降维分析(Dimensionality reduction),细胞聚类分析(Cell-cell clustering),拟时分析(Pseudotime analysis)等。 单细胞转录组技术可以从整体转录组水平研究单个细胞之间的相互作用。然而,在单细胞测序的同时,细胞的位置信息也随之丢失。对于发育生物学的研究,尤其是早期胚胎发育,细胞的位置信息在很大程度上决定了细胞的发育命运。针对这一难题,科学家在空间转录组技术的研究中做出了大量探索性的工作,本文总结了具有开创意义的空间转录组研究工作,包括基于原位的空间染色技术(In situ spatial transcriptome analysis),基于微阵列的空间转录组技术(Array-based spatial transcriptome analysis),基于激光显微切割的空间定位技术(Laser capture microdissection records geographical location)等。 本文在总结单细胞转录组和空间转录组技术研究进展的同时,比较了各种方法的优缺点,梳理了不同方法的适用范围,并从数据整合分析的角度展望了这两项技术在未来生物学研究中所面临的挑战。本文对于单细胞研究和多维数据分析,特别是设计单细胞实验和空间转录组实验,具有很大的参考价值。

Page 13-23

Review Article

Computational Approaches and Challenges in Spatial Transcriptomics

Shuangsang Fang, Bichao Chen, Yong Zhang, Haixi Sun, Longqi Liu, Shiping Liu, Yuxiang Li, Xun Xu

The development of spatial transcriptomics (ST) technologies has transformed genetic research from a single-cell data level to a two-dimensional spatial coordinate system and facilitated the study of the composition and function of various cell subsets in different environments and organs. The large-scale data generated by these ST technologies, which contain spatial gene expression information, have elicited the need for spatially resolved approaches to meet the requirements of computational and biological data interpretation. These requirements include dealing with the explosive growth of data to determine the cell-level and gene-level expression, correcting the inner batch effect and loss of expression to improve the data quality, conducting efficient interpretation and in-depth knowledge mining both at the single-cell and tissue-wide levels, and conducting multi-omics integration analysis to provide an extensible framework toward the in-depth understanding of biological processes. However, algorithms designed specifically for ST technologies to meet these requirements are still in their infancy. Here, we review computational approaches to these problems in light of corresponding issues and challenges, and present forward-looking insights into algorithm development.

Page 24-27

Review Article

Computational Methods for Single-cell DNA Methylome Analysis

Waleed Iqbal, Wanding Zhou

Dissecting intercellular epigenetic differences is key to understanding tissue heterogeneity. Recent advances in single-cell DNA methylome profiling have presented opportunities to resolve this heterogeneity at the maximum resolution. While these advances enable us to explore frontiers of chromatin biology and better understand cell lineage relationships, they pose new challenges in data processing and interpretation. This review surveys the current state of computational tools developed for single-cell DNA methylome data analysis. We discuss critical components of single-cell DNA methylome data analysis, including data preprocessing, quality control, imputation, dimensionality reduction, cell clustering, supervised cell annotation, cell lineage reconstruction, gene activity scoring, and integration with transcriptome data. We also highlight unique aspects of single-cell DNA methylome data analysis and discuss how techniques common to other single-cell omics data analyses can be adapted to analyze DNA methylomes. Finally, we discuss existing challenges and opportunities for future development.

Page 48-66

Review Article

A Survey on Methods for Predicting Polyadenylation Sites from DNA Sequences, Bulk RNA-seq, and Single-cell RNA-seq

Wenbin Ye, Qiwei Lian, Congting Ye, Xiaohui Wu

Alternative polyadenylation (APA) plays important roles in modulating mRNA stability, translation, and subcellular localization, and contributes extensively to shaping eukaryotic transcriptome complexity and proteome diversity. Identification of poly(A) sites (pAs) on a genome-wide scale is a critical step toward understanding the underlying mechanism of APA-mediated gene regulation. A number of established computational tools have been proposed to predict pAs from diverse genomic data. Here we provided an exhaustive overview of computational approaches for predicting pAs from DNA sequences, bulk RNA sequencing (RNA-seq) data, and single-cell RNA sequencing (scRNA-seq) data. Particularly, we examined several representative tools using bulk RNA-seq and scRNA-seq data from peripheral blood mononuclear cells and put forward operable suggestions on how to assess the reliability of pAs predicted by different tools. We also proposed practical guidelines on choosing appropriate methods applicable to diverse scenarios. Moreover, we discussed in depth the challenges in improving the performance of pA prediction and benchmarking different methods. Additionally, we highlighted outstanding challenges and opportunities using new machine learning and integrative multi-omics techniques, and provided our perspective on how computational methodologies might evolve in the future for non-3′ untranslated region, tissue-specific, cross-species, and single-cell pA prediction.

Page 67-83

Review Article

Gut Microbiome in Colorectal Cancer: Clinical Diagnosis and Treatment

Yali Liu, Harry Cheuk-Hay Lau, Wing Yin Cheng, Jun Yu

Colorectal cancer (CRC) is one of the most frequently diagnosed cancers and the leading cause of cancer-associated deaths. Epidemiological studies have shown that both genetic and environmental risk factors contribute to the development of CRC. Several metagenomic studies of CRC have identified gut dysbiosis as a fundamental risk factor in the evolution of colorectal malignancy. Although enormous efforts and substantial progresses have been made in understanding the relationship between human gut microbiome and CRC, the precise mechanisms involved remain elusive. Recent data have shown a direct causative role of the gut microbiome in DNA damage, inflammation, and drug resistance in CRC, suggesting that modulation of gut microbiome could act as a powerful tool in CRC prevention and therapy. Here, we provide an overview of the relationship between gut microbiome and CRC, and explore relevant mechanisms of colorectal tumorigenesis. We next highlight the potential of bacterial species as clinical biomarkers, as well as their roles in therapeutic response. Factors limiting the clinical translation of gut microbiome and strategies for resolving current challenges are further discussed.

Page 84-96

Review Article

Application of Microbiome in Forensics

Jun Zhang, Wenli Liu, Halimureti Simayijiang, Ping Hu, Jiangwei Yan

Recent advances in next-generation sequencing technologies and improvements in bioinformatics have expanded the scope of microbiome analysis as a forensic tool. Microbiome research is concerned with the study of the compositional profile and diversity of microbial flora as well as the interactions between microbes, hosts, and the environment. It has opened up many new possibilities for forensic analysis. In this review, we discuss various applications of microbiome in forensics, including identification of individuals, geolocation inference, and post-mortem interval (PMI) estimation.

Page 97-107

Resource Review

Computational Tools and Resources for CRISPR/Cas Genome Editing

Chao Li, Wen Chu, Rafaqat Ali Gill, Shifei Sang, Yuqin Shi, Xuezhi Hu, Yuting Yang, Qamar U. Zaman, Baohong Zhang

The past decade has witnessed a rapid evolution in identifying more versatile clustered regularly interspaced short palindromic repeats (CRISPR)/CRISPR-associated protein (Cas) nucleases and their functional variants, as well as in developing precise CRISPR/Cas-derived genome editors. The programmable and robust features of the genome editors provide an effective RNA-guided platform for fundamental life science research and subsequent applications in diverse scenarios, including biomedical innovation and targeted crop improvement. One of the most essential principles is to guide alterations in genomic sequences or genes in the intended manner without undesired off-target impacts, which strongly depends on the efficiency and specificity of single guide RNA (sgRNA)-directed recognition of targeted DNA sequences. Recent advances in empirical scoring algorithms and machine learning models have facilitated sgRNA design and off-target prediction. In this review, we first briefly introduce the different features of CRISPR/Cas tools that should be taken into consideration to achieve specific purposes. Secondly, we focus on the computer-assisted tools and resources that are widely used in designing sgRNAs and analyzing CRISPR/Cas-induced on- and off-target mutations. Thirdly, we provide insights into the limitations of available computational tools that would help researchers of this field for further optimization. Lastly, we suggest a simple but effective workflow for choosing and applying web-based resources and tools for CRISPR/Cas genome editing.

Page 108-126

Original Research

The Jasmine (Jasminum sambac) Genome Provides Insight into the Biosynthesis of Flower Fragrances and Jasmonates

Gang Chen, Salma Mostafa, Zhaogeng Lu, Ran Du, Jiawen Cui, Yun Wang, Qinggang Liao, Jinkai Lu, Xinyu Mao, Bang Chang, Quan Gan, Li Wang , Zhichao Jia, Xiulian Yang, Yingfang Zhu, Jianbin Yan, Biao Jin

Jasminum sambac (jasmine flower), a world-renowned plant appreciated for its exceptional flower fragrance, is of cultural and economic importance. However, the genetic basis of its fragrance is largely unknown. Here, we present the first de novo genome assembly of J. sambac with 550.12 Mb (scaffold N50 = 40.10 Mb) assembled into 13 pseudochromosomes. Terpene synthase (TPS) genes associated with flower fragrance are considerably amplified in the form of gene clusters through tandem duplications in the genome. Gene clusters within the salicylic acid/benzoic acid/theobromine (SABATH) and benzylalcohol O-acetyltransferase/anthocyanin O-hydroxycinnamoyltransferases/anthranilate N-hydroxycinnamoyl/benzoyltransferase/deacetylvindoline 4-O-acetyltransferase (BAHD) superfamilies were identified to be related to the biosynthesis of phenylpropanoid/benzenoid compounds. Several key genes involved in jasmonate biosynthesis were duplicated, causing an increase in copy numbers. In addition, multi-omics analyses identified various aromatic compounds and many genes involved in fragrance biosynthesis pathways. Furthermore, the roles of JsTPS3 in β-ocimene biosynthesis, as well as JsAOC1 and JsAOS in jasmonic acid biosynthesis, were functionally validated. The genome assembled in this study for J. sambac offers a basic genetic resource for studying floral scent and jasmonate biosynthesis, and provides a foundation for functional genomic research and variety improvements in Jasminum.
研究问题: 通过茉莉花基因组测序、组装和注释,不同阶段花朵的多组学分析,解析茉莉花中释放出的各种花香挥发物,鉴定并验证参与茉莉花香成分主要芳香化合物生物合成途径的重要基因。 研究方法: 选用双瓣茉莉为材料,通过k-mer分析估测其基因组大小,联合二代及三代测序技术、Hi-C染色体构象捕获技术等破译茉莉花染色体水平的基因组,并进行重复序列和功能基因注释;通过OrthoMCL对候选物种的单拷贝直系同源基因进行鉴定并构建系统发育树揭示物种起源及进化;基于同源基因的同义替换速率和共线性分析研究候选物种的分化及全基因组复制事件;通过PFAM、BLAST等方法注释及挖掘花香合成相关基因;转录组和代谢组(广泛靶向代谢组和离体和活体挥发组)测定花苞和盛开的花中各种花香成分;通过转基因和体外酶促实验验证关键花香成分合成基因的功能。 主要结果1: 报道了茉莉花染色体水平的基因组序列,大小为550.12 Mb。茉莉花的分化要早于木樨科中其他四个物种,茉莉花和桂花的共同祖先分化前就已经发生全基因组复制事件。 主要结果2: 比较基因组分析表明与花香有关的萜类、苯丙素类以及茉莉酸类生物合成相关基因显著扩增。 主要结果3: 多组学分析鉴定了萜类、苯丙素类以及脂肪酸类芳香化合物和参与芳香化合物生物合成的基因。 主要结果4: 在功能上验证了JsTPS3在β-洋烯生物合成中的作用,以及JsAOC1和JsAOS在茉莉酸生物合成中的作用。 数据链接: 基因组组装序列和基因注释文件保存于中国国家生物信息中心国家基因组科学数据中心基因组数据库(Genome Warehouse) (GWH: GWHAZHY00000000);基因组和转录组原始测序数据保存于国家基因组科学数据中心组学原始数据归档库(Genome Sequence Archive) (GSA: CRA008133, CRA005366, CRA005361, and CRA005359) 。

Page 127-149

Original Research

Gut Microbiome Variation Along A Lifestyle Gradient Reveals Threats Faced by Asian Elephants

Chengbo Zhang, Zhenghan Lian, Bo Xu, Qingzhong Shen, Mingwei Bao, Zunxi Huang, Hongchen Jiang, Wenjun Li

The gut microbiome is closely related to host nutrition and health. However, the relationships between gut microorganisms and host lifestyle are not well characterized. In the absence of confounding geographic variation, we defined clear patterns of variation in the gut microbiomes of Asian elephants (AEs) in the Wild Elephant Valley, Xishuangbanna, China, along a lifestyle gradient (completely captive, semicaptive, semiwild, and completely wild). A phylogenetic analysis using the 16S rRNA gene sequences highlighted that the microbial diversity decreased as the degree of captivity increased. Furthermore, the results showed that the bacterial taxon WCHB1-41_c was substantially affected by lifestyle variations. qRT-PCR analysis revealed a paucity of genes related to butyrate production in the gut microbiome of AEs with a completely wild lifestyle, which may be due to the increased unfavorable environmental factors. Overall, these results demonstrate the distinct gut microbiome characteristics among AEs with a gradient of lifestyles and provide a basis for designing strategies to improve the well-being or conservation of this important animal species.
研究问题: 西双版纳野象谷的亚洲象肠道微生物群如何沿着生活方式梯度发生变化?亚洲象肠道微生物群哪些细菌类群显著受到生活方式梯度变化的影响?亚洲象肠道微生物群与宿主营养和健康相关的基因含量随生活方式梯度如何变化? 研究方法: 本研究对亚洲象新鲜粪便样本进行了16S rRNA 扩增子基因测序。测序结果经 α 多样性分析,发现随着圈养程度的增加,亚洲象肠道细菌群落多样性显著降低; β 多样性分析发现不同生活方式亚洲象肠道微生物群落组成存在显著差异,其中细菌类群 WCHB1-41_c 受生活方式变化的影响是显著的;功能预测分析显示,与生活方式变化显著相关的主要功能包括代谢途径、次生代谢产物的生物合成、氨基酸的生物合成等,它们从完全圈养到野生组逐渐富集;丁酸产生菌的丰度分析和 BCoAT 基因的 qPCR 定量分析结果表明,纯野生亚洲象的潜在健康状况面临着问题。 主要结果1: 随着圈养程度的增加,亚洲象肠道细菌群落多样性显著降低。 主要结果2: 不同生活方式亚洲象肠道微生物组成存在显著差异。 主要结果3: 受生活方式变化显著影响的细菌类群及其代谢途径分析。 主要结果4: 野生生活方式存在对亚洲象健康的不利因素。 数据链接:

Page 150-163

Original Research

Comprehensive Analysis of Ubiquitously Expressed Genes in Humans from A Data-driven Perspective

Jianlei Gu, Jiawei Dai, Hui Lu, Hongyu Zhao

Comprehensive characterization of spatial and temporal gene expression patterns in humans is critical for uncovering the regulatory codes of the human genome and understanding the molecular mechanisms of human diseases. Ubiquitously expressed genes (UEGs) refer to the genes expressed across a majority of, if not all, phenotypic and physiological conditions of an organism. It is known that many human genes are broadly expressed across tissues. However, most previous UEG studies have only focused on providing a list of UEGs without capturing their global expression patterns, thus limiting the potential use of UEG information. In this study, we proposed a novel data-driven framework to leverage the extensive collection of ∼ 40,000 human transcriptomes to derive a list of UEGs and their corresponding global expression patterns, which offers a valuable resource to further characterize human transcriptome. Our results suggest that about half (12,234; 49.01%) of the human genes are expressed in at least 80% of human transcriptomes, and the median size of the human transcriptome is 16,342 genes (65.44%). Through gene clustering, we identified a set of UEGs, named LoVarUEGs, which have stable expression across human transcriptomes and can be used as internal reference genes for expression measurement. To further demonstrate the usefulness of this resource, we evaluated the global expression patterns for 16 previously predicted disallowed genes in islet beta cells and found that seven of these genes showed relatively more varied expression patterns, suggesting that the repression of these genes may not be unique to islet beta cells.

Page 164-176

Original Research

LDHA Desuccinylase Sirtuin 5 as A Novel Cancer Metastatic Stimulator in Aggressive Prostate Cancer

Oh Kwang Kwon, In Hyuk Bang, So Young Choi, Ju Mi Jeon, Ann-Yae Na, Yan Gao, Sam Seok Cho, Sung Hwan Ki, Youngshik Choe, Jun Nyung Lee, Yun-Sok Ha, Eun Ju Bae, Tae Gyun Kwon, Byung-Hyun Park, Sangkyu Lee

Prostate cancer (PCa) is the most commonly diagnosed genital cancer in men worldwide. Around 80% of the patients who developed advanced PCa suffered from bone metastasis, with a sharp drop in the survival rate. Despite great efforts, the detailed mechanisms underlying castration-resistant PCa (CRPC) remain unclear. Sirtuin 5 (SIRT5), an NAD+-dependent desuccinylase, is hypothesized to be a key regulator of various cancers. However, compared to other SIRTs, the role of SIRT5 in cancer has not been extensively studied. Here, we revealed significantly decreased SIRT5 levels in aggressive PCa cells relative to the PCa stages. The correlation between the decrease in the SIRT5 level and the patient’s reduced survival rate was also confirmed. Using quantitative global succinylome analysis, we characterized a significant increase in the succinylation at lysine 118 (K118su) of lactate dehydrogenase A (LDHA), which plays a role in increasing LDH activity. As a substrate of SIRT5, LDHA-K118su significantly increased the migration and invasion of PCa cells and LDH activity in PCa patients. This study reveals the reduction of SIRT5 protein expression and LDHA-K118su as a novel mechanism involved in PCa progression, which could serve as a new target to prevent CPRC progression for PCa treatment.

Page 177-189


simplifyEnrichment: A Bioconductor Package for Clustering and Visualizing Functional Enrichment Results

Zuguang Gu, Daniel Hübschmann

Functional enrichment analysis or gene set enrichment analysis is a basic bioinformatics method that evaluates the biological importance of a list of genes of interest. However, it may produce a long list of significant terms with highly redundant information that is difficult to summarize. Current tools to simplify enrichment results by clustering them into groups either still produce redundancy between clusters or do not retain consistent term similarities within clusters. We propose a new method named binary cut for clustering similarity matrices of functional terms. Through comprehensive benchmarks on both simulated and real-world datasets, we demonstrated that binary cut could efficiently cluster functional terms into groups where terms showed consistent similarities within groups and were mutually exclusive between groups. We compared binary cut clustering on the similarity matrices obtained from different similarity measures and found that semantic similarity worked well with binary cut, while similarity matrices based on gene overlap showed less consistent patterns. We implemented the binary cut algorithm in the R package simplifyEnrichment, which additionally provides functionalities for visualizing, summarizing, and comparing the clustering. The simplifyEnrichment package and the documentation are available at

Page 190-202


The First High-quality Reference Genome of Sika Deer Provides Insights into High-tannin Adaptation

Xiumei Xing, Cheng Ai, Tianjiao Wang, Yang Li, Huitao Liu, Pengfei Hu, Guiwu Wang, Huamiao Liu, Hongliang Wang, Ranran Zhang, Junjun Zheng, Xiaobo Wang, Lei Wang, Yuxiao Chang, Qian Qian, Jinghua Yu, Lixin Tang, Shigang Wu, Xiujuan Shao, Alun Li, Peng Cui, Wei Zhan, Sheng Zhao, Zhichao Wu, Xiqun Shao, Yimeng Dong, Min Rong, Yihong Tan, Xuezhe Cui, Shuzhuo Chang, Xingchao Song, Tongao Yang, Limin Sun, Yan Ju, Pei Zhao, Huanhuan Fan, Ying Liu, Xinhui Wang, Wanyun Yang, Min Yang, Tao Wei, Shanshan Song, Jiaping Xu, Zhigang Yue, Qiqi Liang, Chunyi Li, Jue Ruan, Fuhe Yang

Sika deer are known to prefer oak leaves, which are rich in tannins and toxic to most mammals; however, the genetic mechanisms underlying their unique ability to adapt to living in the jungle are still unclear. In identifying the mechanism responsible for the tolerance of a highly toxic diet, we have made a major advancement by explaining the genome of sika deer. We generated the first high-quality, chromosome-level genome assembly of sika deer and measured the correlation between tannin intake and RNA expression in 15 tissues through 180 experiments. Comparative genome analyses showed that the UGT and CYP gene families are functionally involved in the adaptation of sika deer to high-tannin food, especially the expansion of the UGT family 2 subfamily B of UGT genes. The first chromosome-level assembly and genetic characterization of the tolerance to a highly toxic diet suggest that the sika deer genome may serve as an essential resource for understanding evolutionary events and tannin adaptation. Our study provides a paradigm of comparative expressive genomics that can be applied to the study of unique biological features in non-model animals.
研究问题: 梅花鹿自然分布于东亚地区,是世界著名的鹿种之一,其生产的鹿茸乃是一味珍贵的中药材。梅花鹿是研究鹿群体进化、适应性和生物医学的重要物种。然而,目前仍然缺乏高质量的梅花鹿基因组序列。栎树叶富含单宁酸,对大多数哺乳动物都是具有毒性的,而梅花鹿喜食栎树叶却不中毒,是什么导致了梅花鹿对丛林生活独特的适应能力?这种有毒的食物在梅花鹿身体中经历了怎样的代谢和解毒途径? 研究方法: 采集吉林省一头雌性梅花鹿血液样本,对该样本进行高深度测序,具体包括:约57.7x Pacbio long reads,100.6x PE测序数据。使用wtdbg软件初步组装得到梅花鹿基因组,使用Hi-C测序技术将梅花鹿基因组组装至染色体水平。同时对12只梅花鹿进行了不同单宁含量的饲养实验,并对梅花鹿的15个组织进行了RNA测序,得到共1.44 Tb转录组数据。 主要结果1 组装得到目前已知最高质量的梅花鹿全基因组序列,序列总长约2.5 Gb,染色体水平基因组scaffold N50达到78.8 Mb。 主要结果2 梅花鹿群体的地理分布与栎树的生长分布具有高度的一致性,并且梅花鹿对高单宁含量饲料(以栎树叶为主)的耐受性较高。 主要结果3 UGT基因家族特别是UGT2B基因在梅花鹿基因组中显著扩张。 主要结果4 UGT基因家族在梅花鹿肝脏中呈显著的高表达,阐述了UGT基因家族参与梅花鹿饮食中有毒物质代谢的可能的分子机制。 数据链接: 基因组数据链接:; 测序数据链接:;;

Page 203-215


CHDbase: A Comprehensive Knowledgebase for Congenital Heart Disease-related Genes and Clinical Manifestations

Wei-Zhen Zhou, Wenke Li, Huayan Shen, Ruby W. Wang, Wen Chen, Yujing Zhang, Qingyi Zeng, Hao Wang, Meng Yuan, Ziyi Zeng, Jinhui Cui, Chuan-Yun Li, Fred Y. Ye, Zhou Zhou

Congenital heart disease (CHD) is one of the most common causes of major birth defects, with a prevalence of 1%. Although an increasing number of studies have reported the etiology of CHD, the findings scattered throughout the literature are difficult to retrieve and utilize in research and clinical practice. We therefore developed CHDbase, an evidence-based knowledgebase of CHD-related genes and clinical manifestations manually curated from 1114 publications, linking 1124 susceptibility genes and 3591 variations to more than 300 CHD types and related syndromes. Metadata such as the information of each publication and the selected population and samples, the strategy of studies, and the major findings of studies were integrated with each item of the research record. We also integrated functional annotations through parsing ∼ 50 databases/tools to facilitate the interpretation of these genes and variations in disease pathogenicity. We further prioritized the significance of these CHD-related genes with a gene interaction network approach and extracted a core CHD sub-network with 163 genes. The clear genetic landscape of CHD enables the phenotype classification based on the shared genetic origin. Overall, CHDbase provides a comprehensive and freely available resource to study CHD susceptibilities, supporting a wide range of users in the scientific and medical communities. CHDbase is accessible at
研究问题: 先天性心脏病,简称先心病(Congenital heart disease, CHD)是最主要的出生缺陷之一,是婴儿致残致死的重要原因。尽管越来越多的工作报道了先心病遗传学病因,但其结果散落在大量文献中,极难检索并用于后续基础研究和临床实践中。鉴于此,亟需对已发表文献进行系统挖掘与整理,构建先心病基因表型知识库及基因谱、表型谱特征分析,以促进对先心病遗传易感性的全面了解,为研究者和临床医生提供参考。 研究方法: 人工审阅和系统挖掘1114篇已发表文献, 收集包含文献、受试者、研究策略、研究方法和主要结论等多元数据,综合约50个数据库和工具,对纳入基因和遗传变异进行了详细的功能注释,同时整合多种搜索和数据呈现方式,发布了目前最全面的先心病基因表型知识库。 主要成果1: CHDbase包含约150种先心病类型及160种相关综合征的1124个易感基因及3591个遗传变异,构建了全面开放的先心病基因表型知识库。 主要成果2: 采用基因互作网络对1124个先心病基因进行了优先级排序,提取出一个包含163个基因的先心病核心网络;同时,对先心病基因的表达和功能模式进行了系统分析。 主要成果3: 基于基因–表型相关性数据,首次提出了先心病的分子分型,为疾病分类和病因学研究提供参考。 数据库链接:

Page 216-227

Web Server

KinasePhos 3.0: Redesign and Expansion of the Prediction on Kinase-specific Phosphorylation Sites

Renfei Ma, Shangfu Li, Wenshuo Li, Lantian Yao, Hsien-Da Huang, Tzong-Yi Lee

The purpose of this work is to enhance KinasePhos, a machine learning-based kinase-specific phosphorylation site prediction tool. Experimentally verified kinase-specific phosphorylation data were collected from PhosphoSitePlus, UniProtKB, the GPS 5.0, and Phospho.ELM. In total, 41,421 experimentally verified kinase-specific phosphorylation sites were identified. A total of 1380 unique kinases were identified, including 753 with existing classification information from KinBase and the remaining 627 annotated by building a phylogenetic tree. Based on this kinase classification, a total of 771 predictive models were built at the individual, family, and group levels, using at least 15 experimentally verified substrate sites in positive training datasets. The improved models demonstrated their effectiveness compared with other prediction tools. For example, the prediction of sites phosphorylated by the protein kinase B, casein kinase 2, and protein kinase A families had accuracies of 94.5%, 92.5%, and 90.0%, respectively. The average prediction accuracy for all 771 models was 87.2%. For enhancing interpretability, the SHapley Additive exPlanations (SHAP) method was employed to assess feature importance. The web interface of KinasePhos 3.0 has been redesigned to provide comprehensive annotations of kinase-specific phosphorylation sites on multiple proteins. Additionally, considering the large scale of phosphoproteomic data, a downloadable prediction tool is available at or
研究问题: 例如更全面、综合地利用经实验验证的激酶特异性磷酸化位点数据构建便于用户使用的激酶特异性磷酸化位点预测工具? 研究方法: 1. 从 PhosphoSitePlus、GPS 5.0 、 Phospho.ELM和UniProtKB 这四个数据库中收集了41,421 个经过实验验证的激酶特异性磷酸化位点。 2. 参照激酶在KinBase 中分类注释信息,结合构建进化树的方法,对激酶进行分类注释。 3. 对至少含有 15 个激酶特异性磷酸化位点的类别,使用机器学习中的SVM和XGBoost构建预测模型。为了增强可解释性,模型使用了SHapley Additive exPlanations (SHAP) 。 4. 为满足用户不同的使用需求,KinasePhos 3.0推出了人性化的网页版本和可下载的程序版本。 主要成果1: 整理收集到了当前最为全面的、经实验验证的激酶特异性磷酸化位点数据。根据不同的分类水平,在单个激酶、激酶家族和激酶组别三个层面共构建了 771 个预测模型。这些预测模型在与其他相关预测工具的比较中,展示了很好的预测效果,771 个模型的综合平均预测准确度达 87.2%。 主要成果2: 为方便用户使用,KinasePhos 3.0 提供了人性化的网页界面,预测结果和模型特征重要性以图形化的形式直观显示。此外,考虑到数据较大、预测耗时较久的情况,KinasePhos 3.0也开发成了可执行程序,可供用户下载使用。 数据链接: 本研究所用经实验验证的激酶特异性磷酸化位点数据以及预测工具可从 或中下载 算法可以从 及 中下载

Page 228-241