Volume: 21, Issue: 3


Recent Advances in Assembly of Complex Plant Genomes

Weilong Kong, Yibin Wang, Shengcheng Zhang, Jiaxin Yu, Xingtan Zhang

Over the past 20 years, tremendous advances in sequencing technologies and computational algorithms have spurred plant genomic research into a thriving era with hundreds of genomes decoded already, ranging from those of nonvascular plants to those of flowering plants. However, complex plant genome assembly is still challenging and remains difficult to fully resolve with conventional sequencing and assembly methods due to high heterozygosity, highly repetitive sequences, or high ploidy characteristics of complex genomes. Herein, we summarize the challenges of and advances in complex plant genome assembly, including feasible experimental strategies, upgrades to sequencing technology, existing assembly methods, and different phasing algorithms. Moreover, we list actual cases of complex genome projects for readers to refer to and draw upon to solve future problems related to complex genomes. Finally, we expect that the accurate, gapless, telomere-to-telomere, and fully phased assembly of complex plant genomes could soon become routine.

Page 427-439

Original Research

A Chromosome-level Reference Genome of African Oil Palm Provides Insights into Its Divergence and Stress Adaptation

Le Wang, May Lee, Zi Yi Wan, Bin Bai, Baoqing Ye, Yuzer Alfiko, Rahmadsyah Rahmadsyah, Sigit Purwantomo, Zhuojun Song, Antonius Suwanto, Gen Hua Yue

The palm family (Arecaceae), consisting of ∼ 2600 species, is the third most economically important family of plants. The African oil palm (Elaeis guineensis) is one of the most important palms. However, the genome sequences of palms that are currently available are still limited and fragmented. Here, we report a high-quality chromosome-level reference genome of an oil palm, Dura, assembled by integrating long reads with ∼ 150× genome coverage. The assembled genome was 1.7 Gb in size, covering 94.5% of the estimated genome, of which 91.6% was assigned into 16 pseudochromosomes and 73.7% was repetitive sequences. Relying on the conserved synteny with oil palm, the existing draft genome sequences of both date palm and coconut were further assembled into chromosomal level. Transposon burst, particularly long terminal repeat retrotransposons, following the last whole-genome duplication, likely explains the genome size variation across palms. Sequence analysis of the VIRESCENS gene in palms suggests that DNA variations in this gene are related to fruit colors. Recent duplications of highly tandemly repeated pathogenesis-related proteins from the same tandem arrays play an important role in defense responses to Ganoderma. Whole-genome resequencing of both ancestral African and introduced oil palms in Southeast Asia reveals that genes under putative selection are notably associated with stress responses, suggesting adaptation to stresses in the new habitat. The genomic resources and insights gained in this study could be exploited for accelerating genetic improvement and understanding the evolution of palms.

Page 440-454

Original Research

Whole-genome Duplication Reshaped Adaptive Evolution in A Relict Plant Species, Cyclocarya paliurus

Yinquan Qu, Xulan Shang, Ziyan Zeng, Yanhao Yu, Guoliang Bian, Wenling Wang, Li Liu, Li Tian, Shengcheng Zhang, Qian Wang, Dejin Xie, Xuequn Chen, Zhenyang Liao, Yibin Wang, Jian Qin, Wanxia Yang, Caowen Sun, Xiangxiang Fu, Xingtan Zhang, Shengzuo Fang

Cyclocarya paliurus is a relict plant species that survived the last glacial period and shows a population expansion recently. Its leaves have been traditionally used to treat obesity and diabetes with the well-known active ingredient cyclocaric acid B. Here, we presented three C. paliurus genomes from two diploids with different flower morphs and one haplotype-resolved tetraploid assembly. Comparative genomic analysis revealed two rounds of recent whole-genome duplication events and identified 691 genes with dosage effects that likely contribute to adaptive evolution through enhanced photosynthesis and increased accumulation of triterpenoids. Resequencing analysis of 45 C. paliurus individuals uncovered two bottlenecks, consistent with the known events of environmental changes, and many selectively swept genes involved in critical biological functions, including plant defense and secondary metabolite biosynthesis. We also proposed the biosynthesis pathway of cyclocaric acid B based on multi-omics data and identified key genes, in particular gibberellin-related genes, associated with the heterodichogamy in C. paliurus species. Our study sheds light on evolutionary history of C. paliurus and provides genomic resources to study the medicinal herbs.

Page 455-469

Original Research

Haplotype-resolved Genome of Sika Deer Reveals Allele-specific Gene Expression and Chromosome Evolution

Ruobing Han, Lei Han, Xunwu Zhao, Qianghui Wang, Yanling Xia, Heping Li

Despite the scientific and medicinal importance of diploid sika deer (Cervus nippon), its genome resources are limited and haplotype-resolved chromosome-scale assembly is urgently needed. To explore mechanisms underlying the expression patterns of the allele-specific genes in antlers and the chromosome evolution in Cervidae, we report, for the first time, a high-quality haplotype-resolved chromosome-scale genome of sika deer by integrating multiple sequencing strategies, which was anchored to 32 homologous groups with a pair of sex chromosomes (XY). Several expanded genes (RET, PPP2R1A, PPP2R1B, YWHAB, YWHAZ, and RPS6) and positively selected genes (eIF4E, Wnt8A, Wnt9B, BMP4, and TP53) were identified, which could contribute to rapid antler growth without carcinogenesis. A comprehensive and systematic genome-wide analysis of allele expression patterns revealed that most alleles were functionally equivalent in regulating rapid antler growth and inhibiting oncogenesis. Comparative genomic analysis revealed that chromosome fission might occur during the divergence of sika deer and red deer (Cervus elaphus), and the olfactory sensation of sika deer might be more powerful than that of red deer. Obvious inversion regions containing olfactory receptor genes were also identified, which arose since the divergence. In conclusion, the high-quality allele-aware reference genome provides valuable resources for further illustration of the unique biological characteristics of antler, chromosome evolution, and multi-omics research of cervid animals.
研究问题: 组装得到高质量单倍型染色体水平的纯种梅花鹿基因组,并探究梅花鹿和马鹿之间的染色体进化以及梅花鹿茸中的等位基因表达模式。 研究方法: 结合Illumina、PacBio以及Hi-C测序技术,采用Hifiasm、DipAsm、3D-DNA、HapCUT2、WhatsHap等多种方法,组装梅花鹿单倍型染色体水平的参考基因组;借助Sniffles对梅花鹿基因组和马鹿基因组进行比较基因组分析,并分析两者之间的基因组结构变异。使用HiCExplorer对梅花鹿基因组进行3D染色质结构分析;最终结合HISAT2、DESeq2以及Assemblytics对梅花鹿的两个单倍型基因组进行比较分析。 主要成果1: 组装得到梅花鹿的两个单倍型基因组,分别命名为Hap1和Hap2。其中,Hap1的基因组大小为2.71Gb、contigN50为34.98Mb;Hap2的基因组大小为2.57Gb、contigN50为38.09Mb;组装得到的梅花鹿基因组共包含32对同源染色体以及一对性染色体(XY)。 主要成果2: 两个单倍型基因组中同源染色体上的基因数量、内含子数量、外显子数量、重复含量等均极为相似,表明梅花鹿的两个同源染色体对可能发挥着相似的作用。 主要成果3: 梅花鹿中的几个扩张基因(COL4A1, COL4A2, COL4A5, COL4A6, RET, PPP2R1A, PPP2R1B, YWHAB, YWHAZ,RPS6)和正选择基因(eiF4E, Wnt8A, Wnt9B, BMP4, TP53)可能对鹿茸组织器官无癌变迹象的快速生长发育发挥着重要的作用。 主要成果4: 梅花鹿的1号染色体和马鹿的4号染色体以及23号染色体的共线性作用较强,梅花鹿和马鹿的共同祖先在分化的过程中可能伴随着染色体断裂事件,同时这一过程还存在倒置现象。 主要成果5: 梅花鹿的两个单倍型中呈现特异性表达的等位基因涉及到多种生物学过程。 数据链接: 基因组测序数据以及注释文件均保存至中国国家生物信息中心国家基因组科学数据中心。可在https://ngdc.cncb.ac.cn免费下载(GWH: GWHBJVV00000000和GWHBJVU00000000;GSA: CRA007487)。

Page 470-482

Original Research

The First Crested Duck Genome Reveals Clues to Genetic Compensation and Crest Cushion Formation

Guobin Chang, Xiaoya Yuan, Qixin Guo, Hao Bai, Xiaofang Cao, Meng Liu, Zhixiu Wang, Bichun Li, Shasha Wang, Yong Jiang, Zhiquan Wang, Yang Zhang, Qi Xu, Qianqian Song, Rui Pan, Lingling Qiu, Tiantian Gu, Xinsheng Wu, Yulin Bi, Zhengfeng Cao, Yu Zhang, Yang Chen, Hong Li, Jianfeng Liu, Wangcheng Dai, Guohong Chen

The Chinese crested (CC) duck is a unique indigenous waterfowl breed, which has a crest cushion that affects its survival rate. Therefore, the CC duck is an ideal model to investigate the genetic compensation response to maintain genetic stability. In the present study, we first generated a chromosome-level genome of CC ducks. Comparative genomics revealed that genes related to tissue repair, immune function, and tumors were under strong positive selection, indicating that these adaptive changes might enhance cancer resistance and immune response to maintain the genetic stability of CC ducks. We also assembled a Chinese spot-billed (Csp-b) duck genome, and detected the structural variations (SVs) in the genome assemblies of three ducks (i.e., CC duck, Csp-b duck, and Peking duck). Functional analysis revealed that several SVs were related to the immune system of CC ducks, further strongly suggesting that genetic compensation in the anti-tumor and immune systems supports the survival of CC ducks. Moreover, we confirmed that the CC duck originated from the mallard ducks. Finally, we revealed the physiological and genetic basis of crest traits and identified a causative mutation in TAS2R40 that leads to crest formation. Overall, the findings of this study provide new insights into the role of genetic compensation in adaptive evolution.

Page 483-500

Original Research

Draft Genome of White-blotched River Stingray Provides Novel Clues for Niche Adaptation and Skeleton Formation

Jingqi Zhou, Ake Liu, Funan He, Yunbin Zhang, Libing Shen, Jun Yu, Xiang Zhang

The white-blotched river stingray (Potamotrygon leopoldi) is a cartilaginous fish native to the Xingu River, a tributary of the Amazon River system. As a rare freshwater-dwelling cartilaginous fish in the Potamotrygonidae family in which no member has the genome sequencing information available, P. leopoldi provides the evolutionary details in fish phylogeny, niche adaptation, and skeleton formation. In this study, we present its draft genome of 4.11 Gb comprising 16,227 contigs and 13,238 scaffolds, with contig N50 of 3937 kb and scaffold N50 of 5675 kb in size. Our analysis shows that P. leopoldi is a slow-evolving fish that diverged from elephant sharks about 96 million years ago. Moreover, two gene families related to the immune system (immunoglobulin heavy constant delta genes and T-cell receptor alpha/delta variable genes) exhibit expansion in P. leopoldi only. We also identified the Hox gene clusters in P. leopoldi and discovered that seven Hox genes shared by five representative fish species are missing in P. leopoldi. The RNA sequencing data from P. leopoldi and other three fish species demonstrate that fishes have a more diversified tissue expression spectrum when compared to mammals. Our functional studies suggest that lack of the gc gene encoding vitamin D-binding protein in cartilaginous fishes (both P. leopoldi and Callorhinchus milii) could partly explain the absence of hard bone in their endoskeleton. Overall, this genome resource provides new insights into the niche adaptation, body plan, and skeleton formation of P. leopoldi, as well as the genome evolution in cartilaginous fishes.

Page 501-514

Original Research

Newfound Coding Potential of Transcripts Unveils Missing Members of Human Protein Communities

Sébastien Leblanc, Marie A. Brunet, Jean-François Jacques, Amina M. Lekehal, Andréa Duclos, Alexia Tremblay, Alexis Bruggeman-Gascon, Sondos Samandi, Mylène Brunelle, Alan A. Cohen, Michelle S. Scott, Xavier Roucou

Recent proteogenomic approaches have led to the discovery that regions of the transcriptome previously annotated as non-coding regions [i.e., untranslated regions (UTRs), open reading frames overlapping annotated coding sequences in a different reading frame, and non-coding RNAs] frequently encode proteins, termed alternative proteins (altProts). This suggests that previously identified protein–protein interaction (PPI) networks are partially incomplete because altProts are not present in conventional protein databases. Here, we used the proteogenomic resource OpenProt and a combined spectrum- and peptide-centric analysis for the re-analysis of a high-throughput human network proteomics dataset, thereby revealing the presence of 261 altProts in the network. We found 19 genes encoding both an annotated (reference) and an alternative protein interacting with each other. Of the 117 altProts encoded by pseudogenes, 38 are direct interactors of reference proteins encoded by their respective parental genes. Finally, we experimentally validate several interactions involving altProts. These data improve the blueprints of the human PPI network and suggest functional roles for hundreds of altProts.

Page 515-534

Original Research

Preclinical-to-clinical Anti-cancer Drug Response Prediction and Biomarker Identification Using TINDL

David Earl Hostallero, Lixuan Wei, Liewei Wang, Junmei Cairns, Amin Emad

Prediction of the response of cancer patients to different treatments and identification of biomarkers of drug response are two major goals of individualized medicine. Here, we developed a deep learning framework called TINDL, completely trained on preclinical cancer cell lines (CCLs), to predict the response of cancer patients to different treatments. TINDL utilizes a tissue-informed normalization to account for the tissue type and cancer type of the tumors and to reduce the statistical discrepancies between CCLs and patient tumors. Moreover, by making the deep learning black box interpretable, this model identifies a small set of genes whose expression levels are predictive of drug response in the trained model, enabling identification of biomarkers of drug response. Using data from two large databases of CCLs and cancer tumors, we showed that this model can distinguish between sensitive and resistant tumors for 10 (out of 14) drugs, outperforming various other machine learning models. In addition, our small interfering RNA (siRNA) knockdown experiments on 10 genes identified by this model for one of the drugs (tamoxifen) confirmed that tamoxifen sensitivity is substantially influenced by all of these genes in MCF7 cells, and seven of these genes in T47D cells. Furthermore, genes implicated for multiple drugs pointed to shared mechanism of action among drugs and suggested several important signaling pathways. In summary, this study provides a powerful deep learning framework for prediction of drug response and identification of biomarkers of drug response in cancer. The code can be accessed at https://github.com/ddhostallero/tindl.

Page 535-550

Original Research

Morphine Re-arranges Chromatin Spatial Architecture of Primate Cortical Neurons

Liang Wang, Xiaojie Wang, Chunqi Liu, Wei Xu, Weihong Kuang, Qian Bu, Hongchun Li, Ying Zhao, Linhong Jiang, Yaxing Chen, Feng Qin, Shu Li, Qinfan Wei, Xiaocong Liu, Bin Liu, Yuanyuan Chen, Yanping Dai, Hongbo Wang, Jingwei Tian, Gang Cao, Yinglan Zhao, Xiaobo Cen

The expression of linear DNA sequence is precisely regulated by the three-dimensional (3D) architecture of chromatin. Morphine-induced aberrant gene networks of neurons have been extensively investigated; however, how morphine impacts the 3D genomic architecture of neurons is still unknown. Here, we applied digestion-ligation-only high-throughput chromosome conformation capture (DLO Hi-C) technology to investigate the effects of morphine on the 3D chromatin architecture of primate cortical neurons. After receiving continuous morphine administration for 90 days on rhesus monkeys, we discovered that morphine re-arranged chromosome territories, with a total of 391 segmented compartments being switched. Morphine altered over half of the detected topologically associated domains (TADs), most of which exhibited a variety of shifts, followed by separating and fusing types. Analysis of the looping events at kilobase-scale resolution revealed that morphine increased not only the number but also the length of differential loops. Moreover, all identified differentially expressed genes from the RNA sequencing data were mapped to the specific TAD boundaries or differential loops, and were further validated for changed expression. Collectively, an altered 3D genomic architecture of cortical neurons may regulate the gene networks associated with morphine effects. Our finding provides critical hubs connecting chromosome spatial organization and gene networks associated with the morphine effects in humans.

Page 551-572

Original Research

Mapping Multi-factor-mediated Chromatin Interactions to Assess Dysregulation of Lung Cancer-related Genes

Yan Zhang, Jingwen Zhang, Wei Zhang, Mohan Wang, Shuangqi Wang, Yao Xu, Lun Zhao, Xingwang Li, Guoliang Li

Studies on the lung cancer genome are indispensable for developing a cure for lung cancer. Whole-genome resequencing, genome-wide association studies, and transcriptome sequencing have greatly improved our understanding of the cancer genome. However, dysregulation of long-range chromatin interactions in lung cancer remains poorly described. To better understand the three-dimensional (3D) genomic interaction features of the lung cancer genome, we used the A549 cell line as a model system and generated high-resolution chromatin interactions associated with RNA polymerase II (RNAPII), CCCTC-binding factor (CTCF), enhancer of zeste homolog 2 (EZH2), and histone 3 lysine 27 trimethylation (H3K27me3) using long-read chromatin interaction analysis by paired-end tag sequencing (ChIA-PET). Analysis showed that EZH2/H3K27me3-mediated interactions further repressed target genes, either through loops or domains, and their distributions along the genome were distinct from and complementary to those associated with RNAPII. Cancer-related genes were highly enriched with chromatin interactions, and chromatin interactions specific to the A549 cell line were associated with oncogenes and tumor suppressor genes, such as additional repressive interactions on FOXO4 and promoter–promoter interactions between NF1 and RNF135. Knockout of an anchor associated with chromatin interactions reversed the dysregulation of cancer-related genes, suggesting that chromatin interactions are essential for proper expression of lung cancer-related genes. These findings demonstrate the 3D landscape and gene regulatory relationships of the lung cancer genome.
研究问题: 肺癌作为当今社会发病率和致死率最高的癌症类型之一,亟需对其癌细胞基因组进行大数据多组学深入研究。三维基因组远程交互是一种基因组调控的高级结构形式。肺癌细胞基因组的三维基因组结构是否与肺癌发生发展有关,肺癌中某些癌症相关基因表达失调能否从三维基因组远程调控的变化层面予以解释? 研究方法: 通过肺癌A549细胞的多因子介导的长读长ChIA-PET实验以及对照肺癌细胞的数据进行对比分析,辅助以转录组等数据进行多组学协同分析肺癌基因组三维结构的特征、异常和对基因表达的影响,分析抑制性因子介导的染色质交互的特征与功能。 主要结果1: 绘制了肺癌A549细胞RNAPII、CTCF、EZH2和H3K27me3的高分辨率染色质交互图谱,可以展示多层级的染色质结构。 主要结果2: 抑制性因子介导的染色质交互与RNAPII介导的交互具有不同的分布和基因表达调控功能 主要结果3: 癌症相关基因高度富集在染色质相互作用位点,A549细胞系特异性染色质相互作用与癌基因和肿瘤抑制因子表达失调相关 数据链接: https://bigd.big.ac.cn/gsa-human/browse/HRA000295

Page 573-588

Original Research

Dynamic Spatial-temporal Expression Ratio of X Chromosome to Autosomes but Stable Dosage Compensation in Mammals

Sheng Hu Qian, Yu-Li Xiong, Lu Chen, Ying-Jie Geng, Xiao-Man Tang, Zhen-Xia Chen

In the evolutionary model of dosage compensation, per-allele expression level of the X chromosome has been proposed to have twofold up-regulation to compensate its dose reduction in males (XY) compared to females (XX). However, the expression regulation of X-linked genes is still controversial, and comprehensive evaluations are still lacking. By integrating multi-omics datasets in mammals, we investigated the expression ratios including X to autosomes (X:AA ratio) and X to orthologs (X:XX ratio) at the transcriptome, translatome, and proteome levels. We revealed a dynamic spatial-temporal X:AA ratio during development in humans and mice. Meanwhile, by tracing the evolution of orthologous gene expression in chickens, platypuses, and opossums, we found a stable expression ratio of X-linked genes in humans to their autosomal orthologs in other species (X:XX ≈ 1) across tissues and developmental stages, demonstrating stable dosage compensation in mammals. We also found that different epigenetic regulations contributed to the high tissue specificity and stage specificity of X-linked gene expression, thus affecting X:AA ratios. It could be concluded that the dynamics of X:AA ratios were attributed to the different gene contents and expression preferences of the X chromosome, rather than the stable dosage compensation.
研究问题: 在哺乳动物性染色体演化过程中Y染色体逐步退化,导致雄性个体中X染色体连锁的基因变成单倍体,而雌性个体(XX)中维持着两条X染色体,进而引起基因组中X染色体连锁基因在雄性和雌性个体间存在基因剂量的不平衡。据此,演化遗传学家Susumu Ohno提出剂量补偿假说——X连锁基因的表达加倍以补偿雄性(XY)中X染色体的剂量减少,同时雌性(XX)中一条X染色体失活以避免形成X-四体,从而实现两性之间基因表达水平的平衡。近些年来随着高通量测序技术的发展,一些研究通过微阵列和RNA测序报道了某些哺乳动物组织中的X染色体上的基因表达被上调,导致X染色体与常染色体的基因表达比值接近1(X:AA ~ 1),为Ohno的假说提供了支持。然而,也有一些研究向Ohno假说提出了挑战,这些研究仅包含了有限的组织和某些发育时期,导致观测结果可能不具有代表性,甚至产生相反的结论。 研究方法: 为了全面探究哺乳动物剂量补偿的模式,研究人员整合了公开的涵盖多个组织、发育时期和物种的转录组数据、翻译组和蛋白质组数据,对哺乳动物剂量补偿的调控和演化进行了系统的探究。 主要结果: 研究人员首先发现X染色体和常染色体上基因表达比值(X:AA 比值)呈现出组织特异性,在组织发育的早期接近1并随着发育过程动态变化。比较人类和其他陆生脊椎动物的直系同源基因的表达水平,研究人员发现人的X染色体在演化过程中维持着基因表达的平衡,而并非此前报道的表达水平减半。最后,进一步整合分析转录组和表观基因组学数据发现,不同的表观遗传调控塑造了X染色体上基因强烈的时空特异性表达,并最终影响了X:AA比值。

Page 589-600

Original Research

stAPAminer: Mining Spatial Patterns of Alternative Polyadenylation for Spatially Resolved Transcriptomic Studies

Guoli Ji, Qi Tang, Sheng Zhu, Junyi Zhu, Pengchao Ye, Shuting Xia, Xiaohui Wu

Alternative polyadenylation (APA) contributes to transcriptome complexity and gene expression regulation and has been implicated in various cellular processes and diseases. Single-cell RNA sequencing (scRNA-seq) has enabled the profiling of APA at the single-cell level; however, the spatial information of cells is not preserved in scRNA-seq. Alternatively, spatial transcriptomics (ST) technologies provide opportunities to decipher the spatial context of the transcriptomic landscape. Pioneering studies have revealed potential spatially variable genes and/or splice isoforms; however, the pattern of APA usage in spatial contexts remains unappreciated. In this study, we developed a toolkit called stAPAminer for mining spatial patterns of APA from spatially barcoded ST data. APA sites were identified and quantified from the ST data. In particular, an imputation model based on the k-nearest neighbors algorithm was designed to recover APA signals, and then APA genes with spatial patterns of APA usage variation were identified. By analyzing well-established ST data of the mouse olfactory bulb (MOB), we presented a detailed view of spatial APA usage across morphological layers of the MOB. We compiled a comprehensive list of genes with spatial APA dynamics and obtained several major spatial expression patterns that represent spatial APA dynamics in different morphological layers. By extending this analysis to two additional replicates of the MOB ST data, we observed that the spatial APA patterns of several genes were reproducible among replicates. stAPAminer employs the power of ST to explore the transcriptional atlas of spatial APA patterns with spatial resolution. This toolkit is available at https://github.com/BMILAB/stAPAminer and https://ngdc.cncb.ac.cn/biocode/tools/BT007320.

Page 601-618

Original Research

The Integrative Studies on the Functional A-to-I RNA Editing Events in Human Cancers

Sijia Wu, Zhiwei Fan, Pora Kim, Liyu Huang, Xiaobo Zhou

Adenosine-to-inosine (A-to-I) RNA editing, constituting nearly 90% of all RNA editing events in humans, has been reported to contribute to the tumorigenesis in diverse cancers. However, the comprehensive map for functional A-to-I RNA editing events in cancers is still insufficient. To fill this gap, we systematically and intensively analyzed multiple tumorigenic mechanisms of A-to-I RNA editing events in samples across 33 cancer types from The Cancer Genome Atlas. For individual candidate among ∼ 1,500,000 quantified RNA editing events, we performed diverse types of downstream functional annotations. Finally, we identified 24,236 potentially functional A-to-I RNA editing events, including the cases in APOL1, IGFBP3, GRIA2, BLCAP, and miR-589-3p. These events might play crucial roles in the scenarios of tumorigenesis, due to their tumor-related editing frequencies or probable effects on altered expression profiles, protein functions, splicing patterns, and microRNA regulations of tumor genes. Our functional A-to-I RNA editing events (https://ccsm.uth.edu/CAeditome/) will help better understand the cancer pathology from the A-to-I RNA editing aspect.

Page 619-631

Original Research

CNEReg Interprets Ruminant-specific Conserved Non-coding Elements by Developmental Gene Regulatory Network

Xiangyu Pan, Zhaoxia Ma, Xinqi Sun, Hui Li, Tingting Zhang, Chen Zhao, Nini Wang, Rasmus Heller, Wing Hung Wong, Wen Wang, Yu Jiang, Yong Wang

The genetic information coded in DNA leads to trait innovation via a gene regulatory network (GRN) in development. Here, we developed a conserved non-coding element interpretation method to integrate multi-omics data into gene regulatory network (CNEReg) to investigate the ruminant multi-chambered stomach innovation. We generated paired expression and chromatin accessibility data during rumen and esophagus development in sheep, and revealed 1601 active ruminant-specific conserved non-coding elements (active-RSCNEs). To interpret the function of these active-RSCNEs, we defined toolkit transcription factors (TTFs) and modeled their regulation on rumen-specific genes via batteries of active-RSCNEs during development. Our developmental GRN revealed 18 TTFs and 313 active-RSCNEs regulating 7 rumen functional modules. Notably, 6 TTFs (OTX1, SOX21, HOXC8, SOX2, TP63, and PPARG), as well as 16 active-RSCNEs, functionally distinguished the rumen from the esophagus. Our study provides a systematic approach to understanding how gene regulation evolves and shapes complex traits by putting evo-devo concepts into practice with developmental multi-omics data.

Page 632-648

Original Research

Performance Comparison of Computational Methods for the Prediction of the Function and Pathogenicity of Non-coding Variants

Zheng Wang, Guihu Zhao, Bin Li, Zhenghuan Fang, Qian Chen, Xiaomeng Wang, Tengfei Luo, Yijing Wang, Qiao Zhou, Kuokuo Li, Lu Xia, Yi Zhang, Xun Zhou, Hongxu Pan, Yuwen Zhao, Yige Wang, Lin Wang, Jifeng Guo, Beisha Tang, Kun Xia, Jinchen Li

Non-coding variants in the human genome significantly influence human traits and complex diseases via their regulation and modification effects. Hence, an increasing number of computational methods are developed to predict the effects of variants in human non-coding sequences. However, it is difficult for inexperienced users to select appropriate computational methods from dozens of available methods. To solve this issue, we assessed 12 performance metrics of 24 methods on four independent non-coding variant benchmark datasets: (1) rare germline variants from clinical relevant sequence variants (ClinVar), (2) rare somatic variants from Catalogue Of Somatic Mutations In Cancer (COSMIC), (3) common regulatory variants from curated expression quantitative trait locus (eQTL) data, and (4) disease-associated common variants from curated genome-wide association studies (GWAS). All 24 tested methods performed differently under various conditions, indicating varying strengths and weaknesses under different scenarios. Importantly, the performance of existing methods was acceptable for rare germline variants from ClinVar with the area under the receiver operating characteristic curve (AUROC) of 0.4481–0.8033 and poor for rare somatic variants from COSMIC (AUROC = 0.4984–0.7131), common regulatory variants from curated eQTL data (AUROC = 0.4837–0.6472), and disease-associated common variants from curated GWAS (AUROC = 0.4766–0.5188). We also compared the prediction performance of 24 methods for non-coding de novo mutations in autism spectrum disorder, and found that the combined annotation-dependent depletion (CADD) and context-dependent tolerance score (CDTS) methods showed better performance. Summarily, we assessed the performance of 24 computational methods under diverse scenarios, providing preliminary advice for proper tool selection and guiding the development of new techniques in interpreting non-coding variants.

Page 649-661


Computational Assessment of the Expression-modulating Potential for Non-coding Variants

Fang-Yuan Shi, Yu Wang, Dong Huang, Yu Liang, Nan Liang, Xiao-Wei Chen, Ge Gao

Large-scale genome-wide association studies (GWAS) and expression quantitative trait locus (eQTL) studies have identified multiple non-coding variants associated with genetic diseases by affecting gene expression. However, pinpointing causal variants effectively and efficiently remains a serious challenge. Here, we developed CARMEN, a novel algorithm to identify functional non-coding expression-modulating variants. Multiple evaluations demonstrated CARMEN’s superior performance over state-of-the-art tools. Applying CARMEN to GWAS and eQTL datasets further pinpointed several causal variants other than the reported lead single-nucleotide polymorphisms (SNPs). CARMEN scales well with the massive datasets, and is available online as a web server at http://carmen.gao-lab.org.
研究问题: 揭示人类基因组中非编码变异的功能对于理解个体表型及疾病易感性的差异至关重要。由于非编码变异数量庞大,功能机制复杂,准确预测具有基因表达调控功能的非编码变异成为一项极具挑战的重要任务。 研究背景: 人类基因组中约98%的区域为非编码区。近年来的研究表明,超过90%与特定生理或病理性状相关的变异为非编码变异。该研究通过整合非编码变异的基因表达调控信息及相关致病信息,采用多种统计学习模型开发了CARMEN算法,用于预测功能性基因表达调控相关非编码变异。相较于现有方法,其预测准确性显著提升,为后续实验工作提供了新工具。 研究方法: 作者基于ENCODE(Encyclopedia of DNA Elements)数据、DNA理化性质及基因组保守性数据,利用卷积神经网络构建了2424个注释特征。接着,构建了两个模型训练数据集,包括MPRA(massively parallel reporter assay)数据集和HGMD(Human Gene Mutation Database)数据库中标记为致病的调控变异数据集。通过对这两个模型训练数据集进行机器学习模型的训练,并整合得到CARMEN分数,该算法用于预测调控基因表达相关非编码变异。作者进一步将CARMEN模型应用于疾病关联变异,并通过双荧光素酶报告基因实验验证了模型优选出功能性非编码变异。 主要结果1: 基于多维度的变异功能注释,利用数据驱动的特征筛选方法,整合多模型输出,用于预测表达调控相关非编码变异。多组基于独立测试集的评估显示,CARMEN在检测灵敏度(sensitivity)与准确性(accuracy)方面较现有主流预测方法显著提升。 主要结果2: 对已发表的GWAS(genome-wide association studies)变异进行分析,结果显示近半数与特定性状关联的标签突变(tag single-nucleotide polymorphisms, tag SNP)的CARMEN分数低于与其连锁的突变,其中6.65%位点的分数差异超过30倍,提示GWAS报导的tag SNP可能非真正起到调控作用的因果变异(casual SNP)。 主要结果3: 使用CARMEN预测与1型及2型糖尿病相关的基因表达调控相关的非编码变异,通过实验验证预测结果,并提示其可能的作用机制。 工具链接: http://carmen.gao-lab.org

Page 662-673