Volume: 20, Issue: 1


En Route to Completion: What Is An Ideal Reference Genome?

Weihua Pan, Jue Ruan

Page 1-3

Research Article

High-quality Arabidopsis thaliana Genome Assembly with Nanopore and HiFi Long Reads

Bo Wang, Xiaofei Yang, Yanyan Jia, Yu Xu, Peng Jia, Ningxin Dang, Songbo Wang, Tun Xu, Xixi Zhao, Shenghan Gao, Quanbin Dong, Kai Ye

Arabidopsis thaliana is an important and long-established model species for plant molecular biology, genetics, epigenetics, and genomics. However, the latest version of reference genome still contains a significant number of missing segments. Here, we reported a high-quality and almost complete Col-0 genome assembly with two gaps (named Col-XJTU) by combining the Oxford Nanopore Technologies ultra-long reads, Pacific Biosciences high-fidelity long reads, and Hi-C data. The total genome assembly size is 133,725,193 bp, introducing 14.6 Mb of novel sequences compared to the TAIR10.1 reference genome. All five chromosomes of the Col-XJTU assembly are highly accurate with consensus quality (QV) scores > 60 (ranging from 62 to 68), which are higher than those of the TAIR10.1 reference (ranging from 45 to 52). We completely resolved chromosome (Chr) 3 and Chr5 in a telomere-to-telomere manner. Chr4 was completely resolved except the nucleolar organizing regions, which comprise long repetitive DNA fragments. The Chr1 centromere (CEN1), reportedly around 9 Mb in length, is particularly challenging to assemble due to the presence of tens of thousands of CEN180 satellite repeats. Using the cutting-edge sequencing data and novel computational approaches, we assembled a 3.8-Mb-long CEN1 and a 3.5-Mb-long CEN2. We also investigated the structure and epigenetics of centromeres. Four clusters of CEN180 monomers were detected, and the centromere-specific histone H3-like protein (CENH3) exhibited a strong preference for CEN180 Cluster 3. Moreover, we observed hypomethylation patterns in CENH3-enriched regions. We believe that this high-quality genome assembly, Col-XJTU, would serve as a valuable reference to better understand the global pattern of centromeric polymorphisms, as well as the genetic and epigenetic features in plants.
研究问题: 拟南芥基因组的高质量组装方法及其应用,并解析着丝粒序列结构。 研究方法: 针对着丝粒区域超长重复序列难以有效组装的难点,提出细菌人工染色体(bacterial artificial chromosome,BAC)序列锚定(anchor)的混合组装替换策略,综合采用高深度、高精度HiFi(157×)及ONT超长(177×)测序技术、Hi-C染色体构象捕获技术,联合使用hifiasm、NextDenovo、3D-DNA、Juicer等多种组装方法,完成仅剩两个缺口(gap)的高质量拟南芥基因组组装;使用Merqury、bacValidation、TandemTools以及BUSCO等多种计算方法评估Col-XJTU序列精度和结构准确度;结合使用LASTZ、Clustal Omega以及‘phyclust’R包对着丝粒5S rDNA、CEN180高度串联重复序列结构进行了系统聚类和变异分析;使用MACS2和Nanopolish方法分别对着丝粒进行着丝粒特异组蛋白CENH3修饰和DNA甲基化等表观遗传分析。 主要成果1: 完成了拟南芥仅包含2个缺口的高质量基因组组装,其中三号和五号染色体为端粒到端粒(telomere-to-telomere,T2T)的组装,四号染色体的着丝粒区域(CEN4)为无缺口(gap-free)组装。 主要成果2: 着丝粒区域上注释的5S rDNA单体数量是之前报道的2倍,且可分为四类(cluster)。这四类5S rDNA均存在高GC含量和高DNA甲基化的模式。 主要成果3: 6万多个CEN180高度串联重复序列可分为四类,其中类别3在五个着丝粒上均具有显著的着丝粒特异组蛋白CENH3信号。 主要成果4: 着丝粒的DNA甲基化高于着丝粒周围区域,但在着丝粒内部,CENH3信号富集的CEN180高度串联重复区域呈现低甲基化状态。

Page 4-13

Research Article

Genome Assembly of Alfalfa Cultivar Zhongmu-4 and Identification of SNPs Associated with Agronomic Traits

Ruicai Long, Fan Zhang, Zhiwu Zhang, Mingna Li, Lin Chen, Xue Wang, Wenwen Liu, Tiejun Zhang, Long-Xi Yu, Fei He, Xueqian Jiang, Xijiang Yang, Changfu Yang, Zhen Wang, Junmei Kang, Qingchuan Yang

Alfalfa (Medicago sativa L.) is the most important legume forage crop worldwide with high nutritional value and yield. For a long time, the breeding of alfalfa was hampered by lacking reliable information on the autotetraploid genome and molecular markers linked to important agronomic traits. We herein reported the de novo assembly of the allele-aware chromosome-level genome of Zhongmu-4, a cultivar widely cultivated in China, and a comprehensive database of genomic variations based on resequencing of 220 germplasms. Approximate 2.74 Gb contigs (N50 of 2.06 Mb), accounting for 88.39% of the estimated genome, were assembled, and 2.56 Gb contigs were anchored to 32 pseudo-chromosomes. A total of 34,922 allelic genes were identified from the allele-aware genome. We observed the expansion of gene families, especially those related to the nitrogen metabolism, and the increase of repetitive elements including transposable elements, which probably resulted in the increase of Zhongmu-4 genome compared with Medicago truncatula. Population structure analysis revealed that the accessions from Asia and South America had relatively lower genetic diversity than those from Europe, suggesting that geography may influence alfalfa genetic divergence during local adaption. Genome-wide association studies identified 101 single nucleotide polymorphisms (SNPs) associated with 27 agronomic traits. Two candidate genes were predicted to be correlated with fall dormancy and salt response. We believe that the allele-aware chromosome-level genome sequence of Zhongmu-4 combined with the resequencing data of the diverse alfalfa germplasms will facilitate genetic research and genomics-assisted breeding in variety improvement of alfalfa.
研究问题 紫花苜蓿是世界上最重要的牧草,由于其同源四倍体特点和基因组相对较大等原因,紫花苜蓿基因组测序组装长期以来是一个难题。由于缺乏高质量基因组,紫花苜蓿核心种质材料的重测序和重要农艺性状全基因组关联分析等研究进展缓慢,进而制约了紫花苜蓿的遗传改良和新品种培育进程。 研究方法 以我国自主选育的高产优质“中苜4号”紫花苜蓿为材料,使用三代和二代测序数据进行contig组装,然后利用Hi-C测序数据和多倍体组装软件ALLHiC进行同源染色体挂载,对组装获得的同源染色体水平基因组中编码蛋白基因、等位基因、转座子重复序列等进行注释分析,选取多个豆科近缘物种进行进化和基因家族收缩扩张分析,对国内外220份紫花苜蓿核心种质资源材料进行重测序,以“中苜4号”基因组为参考序列筛选高可信度SNP位点,收集和测定了220份紫花苜蓿核心种质资源材料93个农艺性状指标,使用GAPIT3软件进行PCA和GWAS分析。 主要结果1 测序数据组装获得2.74 Gb contig序列(N50:2.06 Mb,BUSCO:98.4%,LAI:13.85),利用Hi-C测序数据将其中2.56 Gb序列锚定到32条染色体上,基因组注释获得146,704个编码蛋白基因,在同源染色体之间共鉴定到34,922个等位基因。 主要结果2 与模式植物蒺藜苜蓿(Medicago truncatula)进行比较分析发现紫花苜蓿氮代谢等通路相关基因家族发生明显的基因家族扩张,转座子等重复元件在基因组中的比例也显著增加。 主要结果3 以“中苜4号” 同源染色体水平基因组为参考序列,结合220份国内外紫花苜蓿核心种质材料重测序数据,筛选获得111,075个高可信度SNP标记。全基因组关联分析获得101个与秋眠性、耐盐性、营养品质等27个重要农艺性状显著关联的SNP标记。 数据链接: https://ngdc.cncb.ac.cn/search/?dbId=gwh&q=GWHBECI00000000 https://bigd.big.ac.cn/gsa/browse/CRA005190

Page 14-28

Research Article

Resequencing 250 Soybean Accessions: New Insights into Genes Associated with Agronomic Traits and Genetic Networks

Chunming Yang, Jun Yan, Shuqin Jiang, Xia Li, Haowei Min, Xiangfeng Wang, Dongyun Hao

The limited knowledge of genomic diversity and functional genes associated with the traits of soybean varieties has resulted in slow progress in breeding. In this study, we sequenced the genomes of 250 soybean landraces and cultivars from China, America, and Europe, and investigated their population structure, genetic diversity and architecture, and the selective sweep regions of these accessions. Five novel agronomically important genes were identified, and the effects of functional mutations in respective genes were examined. The candidate genes GSTT1, GL3, and GSTL3 associated with the isoflavone content, CKX3 associated with yield traits, and CYP85A2 associated with both architecture and yield traits were found. The phenotype–gene network analysis revealed that hub nodes play a crucial role in complex phenotypic associations. This study describes novel agronomic trait-associated genes and a complex genetic network, providing a valuable resource for future soybean molecular breeding.
为加速大豆分子育种的进程,需要对大豆的遗传多样性和重要农艺性状相关的基因开展深入研究。本项目搜集了来自中国、美国和欧洲等地的250 份大豆微核心种质资源,其中包括主要的大豆育成品种和一些代表性地方品种。首先,对这些材料进行全基因组重测序,并通过群体结构、遗传多样性和选择消除等分析,挖掘了在大豆改良过程中潜在的受选择基因组区域。然后,对50个农艺性状进行全基因组关联分析,发现了5个新的候选基因,分别是与异黄酮含量相关的GSTT1、GL3和GSTL3,与产量性状相关的CKX3,以及与株型和产量性状相关的CYP85A2。最后,构建了一个包含全部农艺性状及其关联基因的遗传调控网络,并分析了网络中枢纽节点在复杂性状决定中的关键作用。总之,本研究发现的候选基因和构建的遗传网络为大豆分子育种提供了有价值的信息和资源。

Page 29-41

Research Article

A Chromosome-level Genome Assembly of Wild Castor Provides New Insights into its Adaptive Evolution in Tropical Desert

Jianjun Lu, Cheng Pan, Wei Fan, Wanfei Liu, Huayan Zhao, Donghai Li, Sen Wang, Lianlian Hu, Bing He, Kun Qian, Rui Qin, Jue Ruan, Qiang Lin, Shiyou Lü, Peng Cui

Wild castor grows in the high-altitude tropical desert of the African Plateau, a region known for high ultraviolet radiation, strong light, and extremely dry condition. To investigate the potential genetic basis of adaptation to both highland and tropical deserts, we generated a chromosome-level genome sequence assembly of the wild castor accession WT05, with a genome size of 316 Mb, a scaffold N50 of 31.93 Mb, and a contig N50 of 8.96 Mb, respectively. Compared with cultivated castor and other Euphorbiaceae species, the wild castor exhibits positive selection and gene family expansion for genes involved in DNA repair, photosynthesis, and abiotic stress responses. Genetic variations associated with positive selection were identified in several key genes, such as LIG1, DDB2, and RECG1, involved in nucleotide excision repair. Moreover, a study of genomic diversity among wild and cultivated accessions revealed genomic regions containing selection signatures associated with the adaptation to extreme environments. The identification of the genes and alleles with selection signatures provides insights into the genetic mechanisms underlying the adaptation of wild castor to the high-altitude tropical desert and would facilitate direct improvement of modern castor varieties.

Page 42-59

Research Article

Genomic Perspectives on the Emerging SARS-CoV-2 Omicron Variant

Wentai Ma, Jing Yang, Haoyi Fu, Chao Su, Caixia Yu, Qihui Wang, Ana Tereza Ribeirode Vasconcelos, Georgii A. Bazykin, Yiming Bao, Mingkun Li

A new variant of concern for SARS-CoV-2, Omicron (B.1.1.529), was designated by the World Health Organization on November 26, 2021. This study analyzed the viral genome sequencing data of 108 samples collected from patients infected with Omicron. First, we found that the enrichment efficiency of viral nucleic acids was reduced due to mutations in the region where the primers anneal to. Second, the Omicron variant possesses an excessive number of mutations compared to other variants circulating at the same time (median: 62 vs. 45), especially in the Spike gene. Mutations in the Spike gene confer alterations in 32 amino acid residues, more than those observed in other SARS-CoV-2 variants. Moreover, a large number of nonsynonymous mutations occur in the codons for the amino acid residues located on the surface of the Spike protein, which could potentially affect the replication, infectivity, and antigenicity of SARS-CoV-2. Third, there are 53 mutations between the Omicron variant and its closest sequences available in public databases. Many of these mutations were rarely observed in public databases and had a low mutation rate. In addition, the linkage disequilibrium between these mutations was low, with a limited number of mutations concurrently observed in the same genome, suggesting that the Omicron variant would be in a different evolutionary branch from the currently prevalent variants. To improve our ability to detect and track the source of new variants rapidly, it is imperative to further strengthen genomic surveillance and data sharing globally in a timely manner.
研究问题: 奥密克戎(Omicron)变异株是近期被发现的新型冠状病毒(severe acute respiratory syndrome coronavirus 2,SARS-CoV-2)变异株(variant),也是第五种被世界卫生组织定义为“关切变异株”(Variant of concern,VOC)的变异株。自从2021年11月在非洲被首次发现以来,该变异株就因为刺突蛋白(Spike)上大量的突变和快速的传播速度而受到广泛关注。Omicron基因组突变有何特征?从哪里来?会不会影响病毒逃逸疫苗/抗体的能力?是大家关注的问题。 研究方法: 本研究分析了从南非Omicron感染患者中采集的108份样本的病毒基因组测序数据。 主要结果1: 新冠病毒富集常用的PCR扩增方案对Omicron变异株一些区域富集效率降低。 主要结果2: Omicron变异株的基因组突变数目明显高于同时期流行的其他变异株,且集中在Spike基因区域。 主要结果3: Omicron变异株的突变可能会影响部分抗体的中和作用。 主要结果4: Omicron变异株由近期已发现的其他变异株通过突变或重组产生的概率较小。 数据链接:https://www.ncbi.nlm.nih.gov/bioproject/PRJNA784038/

Page 60-69

Research Article

Single-cell Transcriptomic Analysis Reveals the Cellular Heterogeneity of Mesenchymal Stem Cells

Chen Zhang, Xueshuai Han, Jingkun Liu, Lei Chen, Ying Lei, Kunying Chen, Jia Si, Tian-yi Wang, Hui Zhou, Xiaoyun Zhao, Xiaohui Zhang, Yihua An, Yueying Li, Qian-Fei Wang

Ex vivo-expanded mesenchymal stem cells (MSCs) have been demonstrated to be a heterogeneous mixture of cells exhibiting varying proliferative, multipotential, and immunomodulatory capacities. However, the exact characteristics of MSCs remain largely unknown. By single-cell RNA sequencing of 61,296 MSCs derived from bone marrow and Wharton’s jelly, we revealed five distinct subpopulations. The developmental trajectory of these five MSC subpopulations was mapped, revealing a differentiation path from stem-like active proliferative cells (APCs) to multipotent progenitor cells, followed by branching into two paths: 1) unipotent preadipocytes or 2) bipotent prechondro-osteoblasts that were subsequently differentiated into unipotent prechondrocytes. The stem-like APCs, expressing the perivascular mesodermal progenitor markers CSPG4/MCAM/NES, uniquely exhibited strong proliferation and stemness signatures. Remarkably, the prechondrocyte subpopulation specifically expressed immunomodulatory genes and was able to suppress activated CD3+ T cell proliferation in vitro, supporting the role of this population in immunoregulation. In summary, our analysis mapped the heterogeneous subpopulations of MSCs and identified two subpopulations with potential functions in self-renewal and immunoregulation. Our findings advance the definition of MSCs by identifying the specific functions of their heterogeneous cellular composition, allowing for more specific and effective MSC application through the purification of their functional subpopulations.
研究问题: 体外扩增的间充质干细胞 (mesenchymal stem cells, MSCs) 是一类重要的多能干细胞,具有自我更新、多潜能分化、以及分泌炎性因子等多种功能,是临床应用最广的干细胞产品之一。然而,MSCs细胞异质性较大,限制了MSC的应用和治疗效果。以前的研究尚未准确地鉴定出体外MSCs的异质性亚群及其特征,这是领域普遍关注的科学问题。 研究方法: 为了研究体外MSCs的细胞异质性,我们收集了成体骨髓(bone marrow -derived MSCs, BMMSCs)和新生儿脐带华通胶(wharton’s jelly -derived MSCs, WJMSCs)两种来源共61,296个MSCs。通过单细胞转录组,全面描绘了不同来源MSCs的细胞组成及功能亚群特征;并结合功能实验解析了功能亚群的特异性标志物及功能独特性。 主要结果: 1. BMMSCs和WJMSCs存在相似的5个主要细胞亚群,分别是增殖活跃亚群、多潜能祖细胞亚群、单潜能脂肪前体细胞亚群、双潜能软骨-成骨前体细胞亚群以及软骨前体细胞亚群; 2. 增殖活跃亚群(亚群1)具有较强的干性转录特征; 3. 多潜能间充质祖细胞(mesenchymal progenitor cells, MPC)亚群(亚群2)同时具有成骨、脂肪、软骨三个谱系的分化潜能; 4. 软骨前体细胞亚群(亚群5)具有特异性免疫调节功能。 综上,我们的研究全景式地刻画了MSCs的细胞及功能异质性,揭示了具有自我更新、多潜能分化、免疫调节等不同功能的细胞亚群及其特征。通过识别这些特定功能亚群可推动领域对MSCs的准确定义,促进MSCs临床应用的精准治疗。 数据链接:https://ngdc.cncb.ac.cn/gsa-human/browse/HRA000220

Page 70-86

Research Article

Defining Proximity Proteome of Histone Modifications by Antibody-mediated Protein A-APEX2 Labeling

Xinran Li, Jiaqi Zhou, Wenjuan Zhao, Qing Wen, Weijie Wang, Huipai Peng, Yuan Gao, Kelly J. Bouchonville, Steven M. Offer, Kuiming Chan, Zhiquan Wang, Nan Li, Haiyun Gan

Proximity labeling catalyzed by promiscuous enzymes, such as APEX2, has emerged as a powerful approach to characterize multiprotein complexes and protein–protein interactions. However, current methods depend on the expression of exogenous fusion proteins and cannot be applied to identify proteins surrounding post-translationally modified proteins. To address this limitation, we developed a new method to label proximal proteins of interest by antibody-mediated protein A-ascorbate peroxidase 2 (pA-APEX2) labeling (AMAPEX). In this method, a modified protein is bound in situ by a specific antibody, which then tethers a pA-APEX2 fusion protein. Activation of APEX2 labels the nearby proteins with biotin; the biotinylated proteins are then purified using streptavidin beads and identified by mass spectrometry. We demonstrated the utility of this approach by profiling the proximal proteins of histone modifications including H3K27me3, H3K9me3, H3K4me3, H4K5ac, and H4K12ac, as well as verifying the co-localization of these identified proteins with bait proteins by published ChIP-seq analysis and nucleosome immunoprecipitation. Overall, AMAPEX is an efficient method to identify proteins that are proximal to modified histones.

Page 87-100

Research Article

Epithelial Cells in 2D and 3D Cultures Exhibit Large Differences in Higher-order Genomic Interactions

Xin Liu, Qiu Sun, Qi Wang, Chuansheng Hu, Xuecheng Chen, Hua Li, Daniel M. Czajkowsky, Zhifeng Shao

Recent studies have characterized the genomic structures of many eukaryotic cells, often focusing on their relation to gene expression. However, these studies have largely investigated cells grown in 2D cultures, although the transcriptomes of 3D-cultured cells are generally closer to their in vivo phenotypes. To examine the effects of spatial constraints on chromosome conformation, we investigated the genomic architecture of mouse hepatocytes grown in 2D and 3D cultures using in situ Hi-C. Our results reveal significant differences in higher-order genomic interactions, notably in compartment identity and strength as well as in topologically associating domain (TAD)–TAD interactions, but only minor differences are found at the TAD level. Our RNA-seq analysis reveals an up-regulated expression of genes involved in physiological hepatocyte functions in the 3D-cultured cells. These genes are associated with a subset of structural changes, suggesting that differences in genomic structure are critically important for transcriptional regulation. However, there are also many structural differences that are not directly associated with changes in gene expression, whose cause remains to be determined. Overall, our results indicate that growth in 3D significantly alters higher-order genomic interactions, which may be consequential for a subset of genes that are important for the physiological functioning of the cell.
近年来,许多工作都致力于揭示真核细胞内基因组的空间结构特征,尤其是染色质三维结构和基因表达之间的关系。迄今为止,尽管三维培养细胞的转录组更接近其原位的细胞表型,但大多数染色质的结构研究仍然采用了二维培养条件下生长的细胞。因此,为了解析细胞生长条件对染色质结构的影响,我们应用原位Hi-C(in situ Hi-C)技术研究了小鼠肝脏细胞在二维和三维培养条件下的染色质空间结构。结果表明,不同的培养条件对染色质的高阶结构有显著的影响,这些影响主要出现在染色质区室(compartments)的特征及拓扑结构域(TAD)之间的相互作用强度等高阶尺度,但对于拓扑结构域本身,培养条件的影响并不显著。转录组分析表明,在三维培养的细胞中,肝细胞生理功能相关基因的表达水平发生了明显的上调,与其所在的染色质区室性质相关。这一结果进一步证实了染色质结构在转录调控中的重要作用。然而,我们也发现,许多显著的染色质结构变化与基因表达的改变并没有直接的相关性,其机理有待进一步研究。总而言之,这些结果明确证明了三维培养条件对细胞内染色质结构的直接影响,而这些结构对于维持细胞的生理功能可能具有至关重要的作用。

Page 101-109

Research Article

Npac Is A Co-factor of Histone H3K36me3 and Regulates Transcriptional Elongation in Mouse Embryonic Stem Cells

Npac Is A Co-factor of Histone H3K36me3 and Regulates Transcriptional Elongation in Mouse Embryonic Stem Cells

Chromatin modification contributes to pluripotency maintenance in embryonic stem cells (ESCs). However, the related mechanisms remain obscure. Here, we show that Npac, a “reader” of histone H3 lysine 36 trimethylation (H3K36me3), is required to maintain mouse ESC (mESC) pluripotency since knockdown of Npac causes mESC differentiation. Depletion of Npac in mouse embryonic fibroblasts (MEFs) inhibits reprogramming efficiency. Furthermore, our chromatin immunoprecipitation followed by sequencing (ChIP-seq) results of Npac reveal that Npac co-localizes with histone H3K36me3 in gene bodies of actively transcribed genes in mESCs. Interestingly, we find that Npac interacts with positive transcription elongation factor b (p-TEFb), Ser2-phosphorylated RNA Pol II (RNA Pol II Ser2P), and Ser5-phosphorylated RNA Pol II (RNA Pol II Ser5P). Furthermore, depletion of Npac disrupts transcriptional elongation of the pluripotency genes Nanog and Rif1. Taken together, we propose that Npac is essential for the transcriptional elongation of pluripotency genes by recruiting p-TEFb and interacting with RNA Pol II Ser2P and Ser5P.
研究问题 Npac蛋白对维持小鼠胚胎干细胞多能性和体细胞重编程有何作用?Npac和组蛋白H3K36Me3的全基因组定位是否有相关性?Npac是否基因的转录延伸复合物的一个成分?Npac如何调控与组蛋白H3K36Me3相关的转录延伸? 研究方法 在本研究中,利用基因敲低,定量分析PCR,体细胞重编程以及其他方法证明了Npac是维持小鼠胚胎干细胞的多能性和体细胞重编程的必需因子。利用ChIP-seq技术绘制了全基因组的Npac和组蛋白H3K36me3定位图谱并对两者进行比较,发现了Npac和组蛋白H3K36me3的基因组定位高度重叠和一致。利用蛋白质共沉淀技术揭示了Npac与关键转录延伸因子pTEFb的相互作用。并利用转录延伸分析方法证明了Npac能调控小鼠胚胎干细胞中的基因转录延伸。 主要结果1: Npac是维持小鼠胚胎干细胞的多能性和体细胞重编程的必需因子。 主要结果2: Npac和组蛋白H3K36me3的在小鼠胚胎干细胞中的基因组定位高度重叠和一致。 主要结果3: Npac是转录延伸复合物中的组分 主要结果4: Npac调控小鼠胚胎干细胞中的基因转录延伸

Page 110-128

Research Article

SLM2 Is A Novel Cardiac Splicing Factor Involved in Heart Failure due to Dilated Cardiomyopathy

Jes-Niels Boeckel, Maximilian Möbius-Winkler, Marion Müller, Sabine Rebs, Nicole Eger, Laura Schoppe, Rewati Tappu, Karoline E. Kokot, Jasmin M. Kneuer, Susanne Gaul, Diana M. Bordalo, Alan Lai, Jan Haas, Mahsa Ghanbari, Philipp Drewe-Boss, Martin Liss, Hugo A.Katus, Uwe Ohler, Michael Gotthardt, Ulrich Laufs, Katrin Streckfuss-Bömeke, Benjamin Meder

Alternative mRNA splicing is a fundamental process to increase the versatility of the genome. In humans, cardiac mRNA splicing is involved in the pathophysiology of heart failure. Mutations in the splicing factor RNA binding motif protein 20 (RBM20) cause severe forms of cardiomyopathy. To identify novel cardiomyopathy-associated splicing factors, RNA-seq and tissue-enrichment analyses were performed, which identified up-regulated expression of Sam68-Like mammalian protein 2 (SLM2) in the left ventricle of dilated cardiomyopathy (DCM) patients. In the human heart, SLM2 binds to important transcripts of sarcomere constituents, such as those encoding myosin light chain 2 (MYL2), troponin I3 (TNNI3), troponin T2 (TNNT2), tropomyosin 1/2 (TPM1/2), and titin (TTN). Mechanistically, SLM2 mediates intron retention, prevents exon exclusion, and thereby mediates alternative splicing of the mRNA regions encoding the variable proline-, glutamate-, valine-, and lysine-rich (PEVK) domain and another part of the I-band region of titin. In summary, SLM2 is a novel cardiac splicing regulator with essential functions for maintaining cardiomyocyte integrity by binding to and processing the mRNAs of essential cardiac constituents such as titin.

Page 129-146

Research Article

Convergent Usage of Amino Acids in Human Cancers as A Reversed Process of Tissue Development

Yikai Luo, Han Liang

Genome- and transcriptome-wide amino acid usage preference across different species is a well-studied phenomenon in molecular evolution, but its characteristics and implication in cancer evolution and therapy remain largely unexplored. Here, we analyzed large-scale transcriptome/proteome profiles, such as The Cancer Genome Atlas (TCGA), the Genotype-Tissue Expression (GTEx), and the Clinical Proteomic Tumor Analysis Consortium (CPTAC), and found that compared to normal tissues, different cancer types showed a convergent pattern toward using biosynthetically low-cost amino acids. Such a pattern can be accurately captured by a single index based on the average biosynthetic energy cost of amino acids, termed energy cost per amino acid (ECPA). With this index, we further compared the trends of amino acid usage and the contributing genes in cancer and tissue development, and revealed their reversed patterns. Finally, focusing on the liver, a tissue with a dramatic increase in ECPA during development, we found that ECPA represents a powerful biomarker that could distinguish liver tumors from normal liver samples consistently across 11 independent patient cohorts and outperforms any index based on single genes. Our study reveals an important principle underlying cancer evolution and suggests the global amino acid usage as a system-level biomarker for cancer diagnosis.
研究问题: 起源于不同组织的癌症在发生过程中的细胞氨基酸利用有什么样的趋同模式?这一模式与对应的正常组织发育过程中的氨基酸利用有何关联?哪些基因和信号通路对癌症发生和组织发育过程中的氨基酸利用具有主要影响?基于氨基酸利用的肿瘤进化原理对于癌症诊断提供何种启发? 研究方法: 在本研究中,作者利用大规模癌症和正常组织发育转录组/蛋白质组数据,例如癌症基因组图谱(TCGA)、基因型组织表达(GTEx)和临床蛋白质组学肿瘤分析联盟(CPTAC)等,对这两个过程中组织细胞对二十种氨基酸的选择与利用进行了估计和对比分析。其中,作者主要利用了一个被称为ECPA(energy cost per amino acid)的单数值指数,综合蛋白质序列及对应基因的表达水平这两层信息,对一个样本的基因表达谱所反映的其氨基酸利用率进行了系统性概括。基于在肝脏器官中观察到的强烈的癌症发生与组织发育的氨基酸利用逆向行为,作者分析了多项独立基因表达数据集,详细研究了ECPA指数在区分肝癌组织和配对癌旁正常组织上的能力。 主要结果1: 与正常组织细胞中氨基酸利用度呈高度组织特异性特征相反,起源于对应正常组织的癌症组织具有突破组织器官界限的趋同的氨基酸利用特征。 主要结果2: 不论起源组织类型,癌组织中氨基酸的趋同利用导向的都是对具有更低的生物合成能量的氨基酸的偏好。 主要结果3: 正常组织在发育过程中呈现与癌症发生相反的氨基酸利用趋势,即倾向于更多利用生物合成能量较高的氨基酸。 主要结果4: 综合蛋白质氨基酸序列和基因表达谱信息的ECPA指数是具有明确生物学含义的区分肝癌组织与正常肝组织的有效生物标志物。

Page 147-162

Research Article

Integrative Proteomic Analysis of Multiple Posttranslational Modifications in Inflammatory Response

Feiyang Ji, Menghao Zhou, Huihui Zhu, Zhengyi Jiang, Qirui Li, Xiaoxi Ouyang, Yiming Lv, Sainan Zhang, Tian Wu, Lanjuan Li

Posttranslational modifications (PTMs) of proteins, particularly acetylation, phosphorylation, and ubiquitination, play critical roles in the host innate immune response. PTMs’ dynamic changes and the crosstalk among them are complicated. To build a comprehensive dynamic network of inflammation-related proteins, we integrated data from the whole-cell proteome (WCP), acetylome, phosphoproteome, and ubiquitinome of human and mouse macrophages. Our datasets of acetylation, phosphorylation, and ubiquitination sites helped identify PTM crosstalk within and across proteins involved in the inflammatory response. Stimulation of macrophages by lipopolysaccharide (LPS) resulted in both degradative and non-degradative ubiquitination. Moreover, this study contributes to the interpretation of the roles of known inflammatory molecules and the discovery of novel inflammatory proteins.

Page 163-176

Research Article

Common Postzygotic Mutational Signatures in Healthy Adult Tissues Related to Embryonic Hypoxia

Yaqiang Hong, Dake Zhang, Xiangtian Zhou, Aili Chen, Amir Abliz, Jian Bai, Liang Wang, Qingtao Hu, Kenan Gong, Xiaonan Guan, Mengfei Liu, Xinchang Zheng, Shujuan Lai, Hongzhu Qu, Fuxin Zhao, Shuang Hao, Zhen Wu, Hong Cai, Shaoyan Hu, Yue Ma, Junting Zhang, Yang Ke, Qian-Fei Wang, Wei Chen, Changqing Zeng

Postzygotic mutations are acquired in normal tissues throughout an individual’s lifetime and hold clues for identifying mutagenic factors. Here, we investigated postzygotic mutation spectra of healthy individuals using optimized ultra-deep exome sequencing of the time-series samples from the same volunteer as well as the samples from different individuals. In blood, sperm, and muscle cells, we resolved three common types of mutational signatures. Signatures A and B represent clock-like mutational processes, and the polymorphisms of epigenetic regulation genes influence the proportion of signature B in mutation profiles. Notably, signature C, characterized by C>T transitions at GpCpN sites, tends to be a feature of diverse normal tissues. Mutations of this type are likely to occur early during embryonic development, supported by their relatively high allelic frequencies, presence in multiple tissues, and decrease in occurrence with age. Almost none of the public datasets for tumors feature this signature, except for 19.6% of samples of clear cell renal cell carcinoma with increased activation of the hypoxia-inducible factor 1 (HIF-1) signaling pathway. Moreover, the accumulation of signature C in the mutation profile was accelerated in a human embryonic stem cell line with drug-induced activation of HIF-1α. Thus, embryonic hypoxia may explain this novel signature across multiple normal tissues. Our study suggests that hypoxic condition in an early stage of embryonic development is a crucial factor inducing C>T transitions at GpCpN sites; and individuals’ genetic background may also influence their postzygotic mutation profiles.

Page 177-191

Research Article

Robust Benchmark Structural Variant Calls of An Asian Using State-of-the-art Long-read Sequencing Technologies

Xiao Du, Lili Li, Fan Liang, Sanyang Liu, Wenxin Zhang, Shuai Sun, Yuhui Sun, Fei Fan, Linying Wang, Xinming Liang, Weijin Qiu, Guangyi Fan, Ou Wang, Weifei Yang, Jiezhong Zhang, Yuhui Xiao, Yang Wang, Depeng Wang, Shoufang Qu, Fang Chen, Jie Huang

The importance of structural variants (SVs) for human phenotypes and diseases is now recognized. Although a variety of SV detection platforms and strategies that vary in sensitivity and specificity have been developed, few benchmarking procedures are available to confidently assess their performances in biological and clinical research. To facilitate the validation and application of these SV detection approaches, we established an Asian reference material by characterizing the genome of an Epstein-Barr virus (EBV)-immortalized B lymphocyte line along with identified benchmark regions and high-confidence SV calls. We established a high-confidence SV callset with 8938 SVs by integrating four alignment-based SV callers, including 109× Pacific Biosciences (PacBio) continuous long reads (CLRs), 22× PacBio circular consensus sequencing (CCS) reads, 104× Oxford Nanopore Technologies (ONT) long reads, and 114× Bionano optical mapping platform, and one de novo assembly-based SV caller using CCS reads. A total of 544 randomly selected SVs were validated by PCR amplification and Sanger sequencing, demonstrating the robustness of our SV calls. Combining trio-binning-based haplotype assemblies, we established an SV benchmark for identifying false negatives and false positives by constructing the continuous high-confidence regions (CHCRs), which covered 1.46 gigabase pairs (Gb) and 6882 SVs supported by at least one diploid haplotype assembly. Establishing high-confidence SV calls for a benchmark sample that has been characterized by multiple technologies provides a valuable resource for investigating SVs in human biology, disease, and clinical research.
结构变异(SVs)对人类表型和疾病的有着重要影响。尽管目前在科学研究及临床应用中已经有多种SV的检测平台和鉴定方法被开发使用,但这些检测平台和策略通常具有不同灵敏度和特异性,而市面上很少有SV基准物质可以用于高置信得评估这些SV检测平台和策略在生物学和临床研究中的表现。为了便于此类SV检测方法的验证和应用,我们建立了一套EBV永生化B淋巴细胞系,利用多种最新长片段测序技术对其基因组进行高深度测序,构建了一例亚洲特有的单体型标准物质。此亚洲标准物质精准描述了其基因组基准区间的高置信结构变异,可应用于市面上不同结构变异软件的评估。我们利用了四种基于比对的SV鉴定策略,包括基于109× PacBio continuous long reads (CLR)、 22× PacBio circular consensus sequencing (CCS) reads、 104× Oxford Nanopore (ONT)和 114× Bionano数据的比对策略, 以及一种基于PacBio CCS reads从头组装的SV鉴定策略,建立了包含8938 SV的高置信度SV集。同时,我们通过对544个随机选择的SV进行PCR和Sanger测序实验验证,证明了此SV集合的高准确性及可靠性。基于trio-binning策略我们建立了此标准物质的单倍型基因组,在此之上我们进一步构建了一个连续高置信度基因组区间SV集合(CHCRs),用于识别SV检测中的假阳性和假阴性结果。此高置信度区间SV集(CHCRs)覆盖了1.46GB基因组区域,包含6882个高置信度SVs,这些SVs至少被一个单倍体结果所支持。拥有高置信度SV集的基准物质的建立,将为研究生物学、疾病和临床应用中的结构变异提供非常有价值的资源。

Page 192-204


Mako: A Graph-based Pattern Growth Approach to Detect Complex Structural Variants

Jiadong Lin, Xiaofei Yang, Walter Kosters, Tun Xu, Yanyan Jia, Songbo Wang, Qihui Zhu, Mallory Ryan, Li Guo, Chengsheng Zhang, The Human Genome Structural Variation Consortium , Charles Lee, Scott E. Devine, Evan E. Eichler, Kai Ye

Complex structural variants (CSVs) are genomic alterations that have more than two breakpoints and are considered as the simultaneous occurrence of simple structural variants. However, detecting the compounded mutational signals of CSVs is challenging through a commonly used model-match strategy. As a result, there has been limited progress for CSV discovery compared with simple structural variants. Here, we systematically analyzed the multi-breakpoint connection feature of CSVs, and proposed Mako, utilizing a bottom-up guided model-free strategy, to detect CSVs from paired-end short-read sequencing. Specifically, we implemented a graph-based pattern growth approach, where the graph depicts potential breakpoint connections, and pattern growth enables CSV detection without pre-defined models. Comprehensive evaluations on both simulated and real datasets revealed that Mako outperformed other algorithms. Notably, validation rates of CSVs on real data based on experimental and computational validations as well as manual inspections are around 70%, where the medians of experimental and computational breakpoint shift are 13 bp and 26 bp, respectively. Moreover, the Mako CSV subgraph effectively characterized the breakpoint connections of a CSV event and uncovered a total of 15 CSV types, including two novel types of adjacent segment swap and tandem dispersed duplication. Further analysis of these CSVs also revealed the impact of sequence homology on the formation of CSVs. Mako is publicly available at https://github.com/xjtu-omics/Mako.

Page 205-218