1. En Route to Completion: What Is An Ideal Reference Genome?
Weihua Pan, Jue Ruan
2. High-quality Arabidopsis thaliana Genome Assembly with Nanopore and HiFi Long Reads
Bo Wang, Xiaofei Yang, Yanyan Jia, Yu Xu, Peng Jia, Ningxin Dang, Songbo Wang, Tun Xu, Xixi Zhao, Shenghan Gao, Quanbin Dong, Kai Ye
Arabidopsis thaliana is an important and long-established model species for plant molecular biology, genetics, epigenetics, and genomics. However, the latest version of reference genome still contains a significant number of missing segments. Here, we reported a high-quality and almost complete Col-0 genome assembly with two gaps (named Col-XJTU) by combining the Oxford Nanopore Technologies ultra-long reads, Pacific Biosciences high-fidelity long reads, and Hi-C data. The total genome assembly size is 133,725,193 bp, introducing 14.6 Mb of novel sequences compared to the TAIR10.1 reference genome. All five chromosomes of the Col-XJTU assembly are highly accurate with consensus quality (QV) scores > 60 (ranging from 62 to 68), which are higher than those of the TAIR10.1 reference (ranging from 45 to 52). We completely resolved chromosome (Chr) 3 and Chr5 in a telomere-to-telomere manner. Chr4 was completely resolved except the nucleolar organizing regions, which comprise long repetitive DNA fragments. The Chr1 centromere (CEN1), reportedly around 9 Mb in length, is particularly challenging to assemble due to the presence of tens of thousands of CEN180 satellite repeats. Using the cutting-edge sequencing data and novel computational approaches, we assembled a 3.8-Mb-long CEN1 and a 3.5-Mb-long CEN2. We also investigated the structure and epigenetics of centromeres. Four clusters of CEN180 monomers were detected, and the centromere-specific histone H3-like protein (CENH3) exhibited a strong preference for CEN180 Cluster 3. Moreover, we observed hypomethylation patterns in CENH3-enriched regions. We believe that this high-quality genome assembly, Col-XJTU, would serve as a valuable reference to better understand the global pattern of centromeric polymorphisms, as well as the genetic and epigenetic features in plants.
针对着丝粒区域超长重复序列难以有效组装的难点，提出细菌人工染色体（bacterial artificial chromosome，BAC）序列锚定（anchor）的混合组装替换策略，综合采用高深度、高精度HiFi（157×）及ONT超长（177×）测序技术、Hi-C染色体构象捕获技术，联合使用hifiasm、NextDenovo、3D-DNA、Juicer等多种组装方法，完成仅剩两个缺口（gap）的高质量拟南芥基因组组装；使用Merqury、bacValidation、TandemTools以及BUSCO等多种计算方法评估Col-XJTU序列精度和结构准确度；结合使用LASTZ、Clustal Omega以及‘phyclust’R包对着丝粒5S rDNA、CEN180高度串联重复序列结构进行了系统聚类和变异分析；使用MACS2和Nanopolish方法分别对着丝粒进行着丝粒特异组蛋白CENH3修饰和DNA甲基化等表观遗传分析。
着丝粒区域上注释的5S rDNA单体数量是之前报道的2倍，且可分为四类（cluster）。这四类5S rDNA均存在高GC含量和高DNA甲基化的模式。
3. Genome Assembly of Alfalfa Cultivar Zhongmu-4 and Identification of SNPs Associated with Agronomic Traits
Ruicai Long, Fan Zhang, Zhiwu Zhang, Mingna Li, Lin Chen, Xue Wang, Wenwen Liu, Tiejun Zhang, Long-Xi Yu, Fei He, Xueqian Jiang, Xijiang Yang, Changfu Yang, Zhen Wang, Junmei Kang, Qingchuan Yang
Alfalfa (Medicago sativa L.) is the most important legume forage crop worldwide with high nutritional value and yield. For a long time, the breeding of alfalfa was hampered by lacking reliable information on the autotetraploid genome and molecular markers linked to important agronomic traits. We herein reported the de novo assembly of the allele-aware chromosome-level genome of Zhongmu-4, a cultivar widely cultivated in China, and a comprehensive database of genomic variations based on resequencing of 220 germplasms. Approximate 2.74 Gb contigs (N50 of 2.06 Mb), accounting for 88.39% of the estimated genome, were assembled, and 2.56 Gb contigs were anchored to 32 pseudo-chromosomes. A total of 34,922 allelic genes were identified from the allele-aware genome. We observed the expansion of gene families, especially those related to the nitrogen metabolism, and the increase of repetitive elements including transposable elements, which probably resulted in the increase of Zhongmu-4 genome compared with Medicago truncatula. Population structure analysis revealed that the accessions from Asia and South America had relatively lower genetic diversity than those from Europe, suggesting that geography may influence alfalfa genetic divergence during local adaption. Genome-wide association studies identified 101 single nucleotide polymorphisms (SNPs) associated with 27 agronomic traits. Two candidate genes were predicted to be correlated with fall dormancy and salt response. We believe that the allele-aware chromosome-level genome sequence of Zhongmu-4 combined with the resequencing data of the diverse alfalfa germplasms will facilitate genetic research and genomics-assisted breeding in variety improvement of alfalfa.
测序数据组装获得2.74 Gb contig序列（N50：2.06 Mb，BUSCO：98.4%，LAI：13.85），利用Hi-C测序数据将其中2.56 Gb序列锚定到32条染色体上，基因组注释获得146,704个编码蛋白基因，在同源染色体之间共鉴定到34,922个等位基因。
4. Resequencing 250 Soybean Accessions: New Insights into Genes Associated with Agronomic Traits and Genetic Networks
Chunming Yang, Jun Yan, Shuqin Jiang, Xia Li, Haowei Min, Xiangfeng Wang, Dongyun Hao
The limited knowledge of genomic diversity and functional genes associated with the traits of soybean varieties has resulted in slow progress in breeding. In this study, we sequenced the genomes of 250 soybean landraces and cultivars from China, America, and Europe, and investigated their population structure, genetic diversity and architecture, and the selective sweep regions of these accessions. Five novel agronomically important genes were identified, and the effects of functional mutations in respective genes were examined. The candidate genes GSTT1, GL3, and GSTL3 associated with the isoflavone content, CKX3 associated with yield traits, and CYP85A2 associated with both architecture and yield traits were found. The phenotype–gene network analysis revealed that hub nodes play a crucial role in complex phenotypic associations. This study describes novel agronomic trait-associated genes and a complex genetic network, providing a valuable resource for future soybean molecular breeding.
5. A Chromosome-level Genome Assembly of Wild Castor Provides New Insights into its Adaptive Evolution in Tropical Desert
Jianjun Lu, Cheng Pan, Wei Fan, Wanfei Liu, Huayan Zhao, Donghai Li, Sen Wang, Lianlian Hu, Bing He, Kun Qian, Rui Qin, Jue Ruan, Qiang Lin, Shiyou Lü, Peng Cui
Wild castor grows in the high-altitude tropical desert of the African Plateau, a region known for high ultraviolet radiation, strong light, and extremely dry condition. To investigate the potential genetic basis of adaptation to both highland and tropical deserts, we generated a chromosome-level genome sequence assembly of the wild castor accession WT05, with a genome size of 316 Mb, a scaffold N50 of 31.93 Mb, and a contig N50 of 8.96 Mb, respectively. Compared with cultivated castor and other Euphorbiaceae species, the wild castor exhibits positive selection and gene family expansion for genes involved in DNA repair, photosynthesis, and abiotic stress responses. Genetic variations associated with positive selection were identified in several key genes, such as LIG1, DDB2, and RECG1, involved in nucleotide excision repair. Moreover, a study of genomic diversity among wild and cultivated accessions revealed genomic regions containing selection signatures associated with the adaptation to extreme environments. The identification of the genes and alleles with selection signatures provides insights into the genetic mechanisms underlying the adaptation of wild castor to the high-altitude tropical desert and would facilitate direct improvement of modern castor varieties.
6. Genomic Perspectives on the Emerging SARS-CoV-2 Omicron Variant
Wentai Ma, Jing Yang, Haoyi Fu, Chao Su, Caixia Yu, Qihui Wang, Ana Tereza Ribeirode Vasconcelos, Georgii A. Bazykin, Yiming Bao, Mingkun Li
A new variant of concern for SARS-CoV-2, Omicron (B.1.1.529), was designated by the World Health Organization on November 26, 2021. This study analyzed the viral genome sequencing data of 108 samples collected from patients infected with Omicron. First, we found that the enrichment efficiency of viral nucleic acids was reduced due to mutations in the region where the primers anneal to. Second, the Omicron variant possesses an excessive number of mutations compared to other variants circulating at the same time (median: 62 vs. 45), especially in the Spike gene. Mutations in the Spike gene confer alterations in 32 amino acid residues, more than those observed in other SARS-CoV-2 variants. Moreover, a large number of nonsynonymous mutations occur in the codons for the amino acid residues located on the surface of the Spike protein, which could potentially affect the replication, infectivity, and antigenicity of SARS-CoV-2. Third, there are 53 mutations between the Omicron variant and its closest sequences available in public databases. Many of these mutations were rarely observed in public databases and had a low mutation rate. In addition, the linkage disequilibrium between these mutations was low, with a limited number of mutations concurrently observed in the same genome, suggesting that the Omicron variant would be in a different evolutionary branch from the currently prevalent variants. To improve our ability to detect and track the source of new variants rapidly, it is imperative to further strengthen genomic surveillance and data sharing globally in a timely manner.
奥密克戎（Omicron）变异株是近期被发现的新型冠状病毒（severe acute respiratory syndrome coronavirus 2，SARS-CoV-2）变异株（variant），也是第五种被世界卫生组织定义为“关切变异株”（Variant of concern，VOC）的变异株。自从2021年11月在非洲被首次发现以来，该变异株就因为刺突蛋白（Spike）上大量的突变和快速的传播速度而受到广泛关注。Omicron基因组突变有何特征？从哪里来？会不会影响病毒逃逸疫苗/抗体的能力？是大家关注的问题。
7. Single-cell Transcriptomic Analysis Reveals the Cellular Heterogeneity of Mesenchymal Stem Cells
Chen Zhang, Xueshuai Han, Jingkun Liu, Lei Chen, Ying Lei, Kunying Chen, Jia Si, Tian-yi Wang, Hui Zhou, Xiaoyun Zhao, Xiaohui Zhang, Yihua An, Yueying Li, Qian-Fei Wang
Ex vivo-expanded mesenchymal stem cells (MSCs) have been demonstrated to be a heterogeneous mixture of cells exhibiting varying proliferative, multipotential, and immunomodulatory capacities. However, the exact characteristics of MSCs remain largely unknown. By single-cell RNA sequencing of 61,296 MSCs derived from bone marrow and Wharton’s jelly, we revealed five distinct subpopulations. The developmental trajectory of these five MSC subpopulations was mapped, revealing a differentiation path from stem-like active proliferative cells (APCs) to multipotent progenitor cells, followed by branching into two paths: 1) unipotent preadipocytes or 2) bipotent prechondro-osteoblasts that were subsequently differentiated into unipotent prechondrocytes. The stem-like APCs, expressing the perivascular mesodermal progenitor markers CSPG4/MCAM/NES, uniquely exhibited strong proliferation and stemness signatures. Remarkably, the prechondrocyte subpopulation specifically expressed immunomodulatory genes and was able to suppress activated CD3+ T cell proliferation in vitro, supporting the role of this population in immunoregulation. In summary, our analysis mapped the heterogeneous subpopulations of MSCs and identified two subpopulations with potential functions in self-renewal and immunoregulation. Our findings advance the definition of MSCs by identifying the specific functions of their heterogeneous cellular composition, allowing for more specific and effective MSC application through the purification of their functional subpopulations.
体外扩增的间充质干细胞 (mesenchymal stem cells, MSCs) 是一类重要的多能干细胞，具有自我更新、多潜能分化、以及分泌炎性因子等多种功能，是临床应用最广的干细胞产品之一。然而，MSCs细胞异质性较大，限制了MSC的应用和治疗效果。以前的研究尚未准确地鉴定出体外MSCs的异质性亚群及其特征，这是领域普遍关注的科学问题。
为了研究体外MSCs的细胞异质性，我们收集了成体骨髓（bone marrow -derived MSCs, BMMSCs）和新生儿脐带华通胶（wharton’s jelly -derived MSCs, WJMSCs）两种来源共61,296个MSCs。通过单细胞转录组，全面描绘了不同来源MSCs的细胞组成及功能亚群特征；并结合功能实验解析了功能亚群的特异性标志物及功能独特性。
3. 多潜能间充质祖细胞（mesenchymal progenitor cells, MPC）亚群（亚群2）同时具有成骨、脂肪、软骨三个谱系的分化潜能；
8. Defining Proximity Proteome of Histone Modifications by Antibody-mediated Protein A-APEX2 Labeling
Xinran Li, Jiaqi Zhou, Wenjuan Zhao, Qing Wen, Weijie Wang, Huipai Peng, Yuan Gao, Kelly J. Bouchonville, Steven M. Offer, Kuiming Chan, Zhiquan Wang, Nan Li, Haiyun Gan
Proximity labeling catalyzed by promiscuous enzymes, such as APEX2, has emerged as a powerful approach to characterize multiprotein complexes and protein–protein interactions. However, current methods depend on the expression of exogenous fusion proteins and cannot be applied to identify proteins surrounding post-translationally modified proteins. To address this limitation, we developed a new method to label proximal proteins of interest by antibody-mediated protein A-ascorbate peroxidase 2 (pA-APEX2) labeling (AMAPEX). In this method, a modified protein is bound in situ by a specific antibody, which then tethers a pA-APEX2 fusion protein. Activation of APEX2 labels the nearby proteins with biotin; the biotinylated proteins are then purified using streptavidin beads and identified by mass spectrometry. We demonstrated the utility of this approach by profiling the proximal proteins of histone modifications including H3K27me3, H3K9me3, H3K4me3, H4K5ac, and H4K12ac, as well as verifying the co-localization of these identified proteins with bait proteins by published ChIP-seq analysis and nucleosome immunoprecipitation. Overall, AMAPEX is an efficient method to identify proteins that are proximal to modified histones.
9. Epithelial Cells in 2D and 3D Cultures Exhibit Large Differences in Higher-order Genomic Interactions
Xin Liu, Qiu Sun, Qi Wang, Chuansheng Hu, Xuecheng Chen, Hua Li, Daniel M. Czajkowsky, Zhifeng Shao
Recent studies have characterized the genomic structures of many eukaryotic cells, often focusing on their relation to gene expression. However, these studies have largely investigated cells grown in 2D cultures, although the transcriptomes of 3D-cultured cells are generally closer to their in vivo phenotypes. To examine the effects of spatial constraints on chromosome conformation, we investigated the genomic architecture of mouse hepatocytes grown in 2D and 3D cultures using in situ Hi-C. Our results reveal significant differences in higher-order genomic interactions, notably in compartment identity and strength as well as in topologically associating domain (TAD)–TAD interactions, but only minor differences are found at the TAD level. Our RNA-seq analysis reveals an up-regulated expression of genes involved in physiological hepatocyte functions in the 3D-cultured cells. These genes are associated with a subset of structural changes, suggesting that differences in genomic structure are critically important for transcriptional regulation. However, there are also many structural differences that are not directly associated with changes in gene expression, whose cause remains to be determined. Overall, our results indicate that growth in 3D significantly alters higher-order genomic interactions, which may be consequential for a subset of genes that are important for the physiological functioning of the cell.
近年来，许多工作都致力于揭示真核细胞内基因组的空间结构特征，尤其是染色质三维结构和基因表达之间的关系。迄今为止，尽管三维培养细胞的转录组更接近其原位的细胞表型，但大多数染色质的结构研究仍然采用了二维培养条件下生长的细胞。因此，为了解析细胞生长条件对染色质结构的影响，我们应用原位Hi-C（in situ Hi-C）技术研究了小鼠肝脏细胞在二维和三维培养条件下的染色质空间结构。结果表明，不同的培养条件对染色质的高阶结构有显著的影响，这些影响主要出现在染色质区室（compartments）的特征及拓扑结构域（TAD）之间的相互作用强度等高阶尺度，但对于拓扑结构域本身，培养条件的影响并不显著。转录组分析表明，在三维培养的细胞中，肝细胞生理功能相关基因的表达水平发生了明显的上调，与其所在的染色质区室性质相关。这一结果进一步证实了染色质结构在转录调控中的重要作用。然而，我们也发现，许多显著的染色质结构变化与基因表达的改变并没有直接的相关性，其机理有待进一步研究。总而言之，这些结果明确证明了三维培养条件对细胞内染色质结构的直接影响，而这些结构对于维持细胞的生理功能可能具有至关重要的作用。
10. Npac Is A Co-factor of Histone H3K36me3 and Regulates Transcriptional Elongation in Mouse Embryonic Stem Cells
Npac Is A Co-factor of Histone H3K36me3 and Regulates Transcriptional Elongation in Mouse Embryonic Stem Cells
Chromatin modification contributes to pluripotency maintenance in embryonic stem cells (ESCs). However, the related mechanisms remain obscure. Here, we show that Npac, a “reader” of histone H3 lysine 36 trimethylation (H3K36me3), is required to maintain mouse ESC (mESC) pluripotency since knockdown of Npac causes mESC differentiation. Depletion of Npac in mouse embryonic fibroblasts (MEFs) inhibits reprogramming efficiency. Furthermore, our chromatin immunoprecipitation followed by sequencing (ChIP-seq) results of Npac reveal that Npac co-localizes with histone H3K36me3 in gene bodies of actively transcribed genes in mESCs. Interestingly, we find that Npac interacts with positive transcription elongation factor b (p-TEFb), Ser2-phosphorylated RNA Pol II (RNA Pol II Ser2P), and Ser5-phosphorylated RNA Pol II (RNA Pol II Ser5P). Furthermore, depletion of Npac disrupts transcriptional elongation of the pluripotency genes Nanog and Rif1. Taken together, we propose that Npac is essential for the transcriptional elongation of pluripotency genes by recruiting p-TEFb and interacting with RNA Pol II Ser2P and Ser5P.
11. SLM2 Is A Novel Cardiac Splicing Factor Involved in Heart Failure due to Dilated Cardiomyopathy
Jes-Niels Boeckel, Maximilian Möbius-Winkler, Marion Müller, Sabine Rebs, Nicole Eger, Laura Schoppe, Rewati Tappu, Karoline E. Kokot, Jasmin M. Kneuer, Susanne Gaul, Diana M. Bordalo, Alan Lai, Jan Haas, Mahsa Ghanbari, Philipp Drewe-Boss, Martin Liss, Hugo A.Katus, Uwe Ohler, Michael Gotthardt, Ulrich Laufs, Katrin Streckfuss-Bömeke, Benjamin Meder
Alternative mRNA splicing is a fundamental process to increase the versatility of the genome. In humans, cardiac mRNA splicing is involved in the pathophysiology of heart failure. Mutations in the splicing factor RNA binding motif protein 20 (RBM20) cause severe forms of cardiomyopathy. To identify novel cardiomyopathy-associated splicing factors, RNA-seq and tissue-enrichment analyses were performed, which identified up-regulated expression of Sam68-Like mammalian protein 2 (SLM2) in the left ventricle of dilated cardiomyopathy (DCM) patients. In the human heart, SLM2 binds to important transcripts of sarcomere constituents, such as those encoding myosin light chain 2 (MYL2), troponin I3 (TNNI3), troponin T2 (TNNT2), tropomyosin 1/2 (TPM1/2), and titin (TTN). Mechanistically, SLM2 mediates intron retention, prevents exon exclusion, and thereby mediates alternative splicing of the mRNA regions encoding the variable proline-, glutamate-, valine-, and lysine-rich (PEVK) domain and another part of the I-band region of titin. In summary, SLM2 is a novel cardiac splicing regulator with essential functions for maintaining cardiomyocyte integrity by binding to and processing the mRNAs of essential cardiac constituents such as titin.
12. Convergent Usage of Amino Acids in Human Cancers as A Reversed Process of Tissue Development
Yikai Luo, Han Liang
Genome- and transcriptome-wide amino acid usage preference across different species is a well-studied phenomenon in molecular evolution, but its characteristics and implication in cancer evolution and therapy remain largely unexplored. Here, we analyzed large-scale transcriptome/proteome profiles, such as The Cancer Genome Atlas (TCGA), the Genotype-Tissue Expression (GTEx), and the Clinical Proteomic Tumor Analysis Consortium (CPTAC), and found that compared to normal tissues, different cancer types showed a convergent pattern toward using biosynthetically low-cost amino acids. Such a pattern can be accurately captured by a single index based on the average biosynthetic energy cost of amino acids, termed energy cost per amino acid (ECPA). With this index, we further compared the trends of amino acid usage and the contributing genes in cancer and tissue development, and revealed their reversed patterns. Finally, focusing on the liver, a tissue with a dramatic increase in ECPA during development, we found that ECPA represents a powerful biomarker that could distinguish liver tumors from normal liver samples consistently across 11 independent patient cohorts and outperforms any index based on single genes. Our study reveals an important principle underlying cancer evolution and suggests the global amino acid usage as a system-level biomarker for cancer diagnosis.
在本研究中，作者利用大规模癌症和正常组织发育转录组/蛋白质组数据，例如癌症基因组图谱（TCGA）、基因型组织表达（GTEx）和临床蛋白质组学肿瘤分析联盟（CPTAC）等，对这两个过程中组织细胞对二十种氨基酸的选择与利用进行了估计和对比分析。其中，作者主要利用了一个被称为ECPA（energy cost per amino acid）的单数值指数，综合蛋白质序列及对应基因的表达水平这两层信息，对一个样本的基因表达谱所反映的其氨基酸利用率进行了系统性概括。基于在肝脏器官中观察到的强烈的癌症发生与组织发育的氨基酸利用逆向行为，作者分析了多项独立基因表达数据集，详细研究了ECPA指数在区分肝癌组织和配对癌旁正常组织上的能力。
13. Integrative Proteomic Analysis of Multiple Posttranslational Modifications in Inflammatory Response
Feiyang Ji, Menghao Zhou, Huihui Zhu, Zhengyi Jiang, Qirui Li, Xiaoxi Ouyang, Yiming Lv, Sainan Zhang, Tian Wu, Lanjuan Li
Posttranslational modifications (PTMs) of proteins, particularly acetylation, phosphorylation, and ubiquitination, play critical roles in the host innate immune response. PTMs’ dynamic changes and the crosstalk among them are complicated. To build a comprehensive dynamic network of inflammation-related proteins, we integrated data from the whole-cell proteome (WCP), acetylome, phosphoproteome, and ubiquitinome of human and mouse macrophages. Our datasets of acetylation, phosphorylation, and ubiquitination sites helped identify PTM crosstalk within and across proteins involved in the inflammatory response. Stimulation of macrophages by lipopolysaccharide (LPS) resulted in both degradative and non-degradative ubiquitination. Moreover, this study contributes to the interpretation of the roles of known inflammatory molecules and the discovery of novel inflammatory proteins.
14. Common Postzygotic Mutational Signatures in Healthy Adult Tissues Related to Embryonic Hypoxia
Yaqiang Hong, Dake Zhang, Xiangtian Zhou, Aili Chen, Amir Abliz, Jian Bai, Liang Wang, Qingtao Hu, Kenan Gong, Xiaonan Guan, Mengfei Liu, Xinchang Zheng, Shujuan Lai, Hongzhu Qu, Fuxin Zhao, Shuang Hao, Zhen Wu, Hong Cai, Shaoyan Hu, Yue Ma, Junting Zhang, Yang Ke, Qian-Fei Wang, Wei Chen, Changqing Zeng
Postzygotic mutations are acquired in normal tissues throughout an individual’s lifetime and hold clues for identifying mutagenic factors. Here, we investigated postzygotic mutation spectra of healthy individuals using optimized ultra-deep exome sequencing of the time-series samples from the same volunteer as well as the samples from different individuals. In blood, sperm, and muscle cells, we resolved three common types of mutational signatures. Signatures A and B represent clock-like mutational processes, and the polymorphisms of epigenetic regulation genes influence the proportion of signature B in mutation profiles. Notably, signature C, characterized by C>T transitions at GpCpN sites, tends to be a feature of diverse normal tissues. Mutations of this type are likely to occur early during embryonic development, supported by their relatively high allelic frequencies, presence in multiple tissues, and decrease in occurrence with age. Almost none of the public datasets for tumors feature this signature, except for 19.6% of samples of clear cell renal cell carcinoma with increased activation of the hypoxia-inducible factor 1 (HIF-1) signaling pathway. Moreover, the accumulation of signature C in the mutation profile was accelerated in a human embryonic stem cell line with drug-induced activation of HIF-1α. Thus, embryonic hypoxia may explain this novel signature across multiple normal tissues. Our study suggests that hypoxic condition in an early stage of embryonic development is a crucial factor inducing C>T transitions at GpCpN sites; and individuals’ genetic background may also influence their postzygotic mutation profiles.
15. Robust Benchmark Structural Variant Calls of An Asian Using State-of-the-art Long-read Sequencing Technologies
Xiao Du, Lili Li, Fan Liang, Sanyang Liu, Wenxin Zhang, Shuai Sun, Yuhui Sun, Fei Fan, Linying Wang, Xinming Liang, Weijin Qiu, Guangyi Fan, Ou Wang, Weifei Yang, Jiezhong Zhang, Yuhui Xiao, Yang Wang, Depeng Wang, Shoufang Qu, Fang Chen, Jie Huang
The importance of structural variants (SVs) for human phenotypes and diseases is now recognized. Although a variety of SV detection platforms and strategies that vary in sensitivity and specificity have been developed, few benchmarking procedures are available to confidently assess their performances in biological and clinical research. To facilitate the validation and application of these SV detection approaches, we established an Asian reference material by characterizing the genome of an Epstein-Barr virus (EBV)-immortalized B lymphocyte line along with identified benchmark regions and high-confidence SV calls. We established a high-confidence SV callset with 8938 SVs by integrating four alignment-based SV callers, including 109× Pacific Biosciences (PacBio) continuous long reads (CLRs), 22× PacBio circular consensus sequencing (CCS) reads, 104× Oxford Nanopore Technologies (ONT) long reads, and 114× Bionano optical mapping platform, and one de novo assembly-based SV caller using CCS reads. A total of 544 randomly selected SVs were validated by PCR amplification and Sanger sequencing, demonstrating the robustness of our SV calls. Combining trio-binning-based haplotype assemblies, we established an SV benchmark for identifying false negatives and false positives by constructing the continuous high-confidence regions (CHCRs), which covered 1.46 gigabase pairs (Gb) and 6882 SVs supported by at least one diploid haplotype assembly. Establishing high-confidence SV calls for a benchmark sample that has been characterized by multiple technologies provides a valuable resource for investigating SVs in human biology, disease, and clinical research.
结构变异（SVs）对人类表型和疾病的有着重要影响。尽管目前在科学研究及临床应用中已经有多种SV的检测平台和鉴定方法被开发使用，但这些检测平台和策略通常具有不同灵敏度和特异性，而市面上很少有SV基准物质可以用于高置信得评估这些SV检测平台和策略在生物学和临床研究中的表现。为了便于此类SV检测方法的验证和应用，我们建立了一套EBV永生化B淋巴细胞系，利用多种最新长片段测序技术对其基因组进行高深度测序，构建了一例亚洲特有的单体型标准物质。此亚洲标准物质精准描述了其基因组基准区间的高置信结构变异，可应用于市面上不同结构变异软件的评估。我们利用了四种基于比对的SV鉴定策略，包括基于109× PacBio continuous long reads (CLR)、 22× PacBio circular consensus sequencing (CCS) reads、 104× Oxford Nanopore (ONT)和 114× Bionano数据的比对策略, 以及一种基于PacBio CCS reads从头组装的SV鉴定策略，建立了包含8938 SV的高置信度SV集。同时，我们通过对544个随机选择的SV进行PCR和Sanger测序实验验证，证明了此SV集合的高准确性及可靠性。基于trio-binning策略我们建立了此标准物质的单倍型基因组，在此之上我们进一步构建了一个连续高置信度基因组区间SV集合（CHCRs），用于识别SV检测中的假阳性和假阴性结果。此高置信度区间SV集（CHCRs）覆盖了1.46GB基因组区域，包含6882个高置信度SVs，这些SVs至少被一个单倍体结果所支持。拥有高置信度SV集的基准物质的建立，将为研究生物学、疾病和临床应用中的结构变异提供非常有价值的资源。
16. Mako: A Graph-based Pattern Growth Approach to Detect Complex Structural Variants
Jiadong Lin, Xiaofei Yang, Walter Kosters, Tun Xu, Yanyan Jia, Songbo Wang, Qihui Zhu, Mallory Ryan, Li Guo, Chengsheng Zhang, The Human Genome Structural Variation Consortium , Charles Lee, Scott E. Devine, Evan E. Eichler, Kai Ye
Complex structural variants (CSVs) are genomic alterations that have more than two breakpoints and are considered as the simultaneous occurrence of simple structural variants. However, detecting the compounded mutational signals of CSVs is challenging through a commonly used model-match strategy. As a result, there has been limited progress for CSV discovery compared with simple structural variants. Here, we systematically analyzed the multi-breakpoint connection feature of CSVs, and proposed Mako, utilizing a bottom-up guided model-free strategy, to detect CSVs from paired-end short-read sequencing. Specifically, we implemented a graph-based pattern growth approach, where the graph depicts potential breakpoint connections, and pattern growth enables CSV detection without pre-defined models. Comprehensive evaluations on both simulated and real datasets revealed that Mako outperformed other algorithms. Notably, validation rates of CSVs on real data based on experimental and computational validations as well as manual inspections are around 70%, where the medians of experimental and computational breakpoint shift are 13 bp and 26 bp, respectively. Moreover, the Mako CSV subgraph effectively characterized the breakpoint connections of a CSV event and uncovered a total of 15 CSV types, including two novel types of adjacent segment swap and tandem dispersed duplication. Further analysis of these CSVs also revealed the impact of sequence homology on the formation of CSVs. Mako is publicly available at https://github.com/xjtu-omics/Mako.