Review Article
Chasing Sequencing Perfection: Marching Toward Higher Accuracy and Lower Costs
Hangxing Jia, Shengjun Tan, Yong E Zhang
View
abstract
Next-generation sequencing (NGS), represented by Illumina platforms, has been an essential cornerstone of basic and applied research. However, the sequencing error rate of 1 per 1000 bp (10−3) represents a serious hurdle for research areas focusing on rare mutations, such as somatic mosaicism or microbe heterogeneity. By examining the high-fidelity sequencing methods developed in the past decade, we summarized three major factors underlying errors and the corresponding 12 strategies mitigating these errors. We then proposed a novel framework to classify 11 preexisting representative methods according to the corresponding combinatory strategies and identified three trends that emerged during methodological developments. We further extended this analysis to eight long-read sequencing methods, emphasizing error reduction strategies. Finally, we suggest two promising future directions that could achieve comparable or even higher accuracy with lower costs in both NGS and long-read sequencing.
通过对已有的测序错误减少方法和进展的全面解析,研究人员总结出12个提高测序准确度的策略并构建了一个全新的分析框架来分析和比较不同方法在降低错误率方向上的表现,对未来的新方法开发指明了方向。
Page qzae024
Review Article
Opportunities and Challenges in Advancing Plant Research with Single-cell Omics
Mohammad Saidur Rhaman, Muhammad Ali, Wenxiu Ye, Bosheng Li
View
abstract
Plants possess diverse cell types and intricate regulatory mechanisms to adapt to the ever-changing environment of nature. Various strategies have been employed to study cell types and their developmental progressions, including single-cell sequencing methods which provide high-dimensional catalogs to address biological concerns. In recent years, single-cell sequencing technologies in transcriptomics, epigenomics, proteomics, metabolomics, and spatial transcriptomics have been increasingly used in plant science to reveal intricate biological relationships at the single-cell level. However, the application of single-cell technologies to plants is more limited due to the challenges posed by cell structure. This review outlines the advancements in single-cell omics technologies, their implications in plant systems, future research applications, and the challenges of single-cell omics in plant systems.
Page qzae026
Original Research
Benchmarking Algorithms for Gene Set Scoring of Single-cell ATAC-seq Data
Xi Wang, Qiwei Lian, Haoyu Dong, Shuo Xu, Yaru Su, Xiaohui Wu
View
abstract
Gene set scoring (GSS) has been routinely conducted for gene expression analysis of bulk or single-cell RNA sequencing (RNA-seq) data, which helps to decipher single-cell heterogeneity and cell type-specific variability by incorporating prior knowledge from functional gene sets. Single-cell assay for transposase accessible chromatin using sequencing (scATAC-seq) is a powerful technique for interrogating single-cell chromatin-based gene regulation, and genes or gene sets with dynamic regulatory potentials can be regarded as cell type-specific markers as if in single-cell RNA-seq (scRNA-seq). However, there are few GSS tools specifically designed for scATAC-seq, and the applicability and performance of RNA-seq GSS tools on scATAC-seq data remain to be investigated. Here, we systematically benchmarked ten GSS tools, including four bulk RNA-seq tools, five scRNA-seq tools, and one scATAC-seq method. First, using matched scATAC-seq and scRNA-seq datasets, we found that the performance of GSS tools on scATAC-seq data was comparable to that on scRNA-seq, suggesting their applicability to scATAC-seq. Then, the performance of different GSS tools was extensively evaluated using up to ten scATAC-seq datasets. Moreover, we evaluated the impact of gene activity conversion, dropout imputation, and gene set collections on the results of GSS. Results show that dropout imputation can significantly promote the performance of almost all GSS tools, while the impact of gene activity conversion methods or gene set collections on GSS performance is more dependent on GSS tools or datasets. Finally, we provided practical guidelines for choosing appropriate preprocessing methods and GSS tools in different application scenarios.
Page qzae014
Original Research
Multi-omic Analyses Shed Light on The Genetic Control of High-altitude Adaptation in Sheep
Chao Li, Bingchun Chen, Suo Langda, Peng Pu, Xiaojia Zhu, Shiwei Zhou, Peter Kalds, Ke Zhang, Meenu Bhati, Alexander Leonard, Shuhong Huang, Ran Li, Awang Cuoji, Xiran Wang, Haolin Zhu, Yujiang Wu, Renqin Cuomu, Ba Gui, Ming Li, Yutao Wang, Yan Li, Wenwen Fang, Ting Jia, Tianchun Pu, Xiangyu Pan, Yudong Cai, Chong He, Liming Wang, Yu Jiang, Jian-Lin Han, Yulin Chen, Ping Zhou, Hubert Pausch, Xiaolong Wang
View
abstract
Sheep were domesticated in the Fertile Crescent and then spread globally, where they have been encountering various environmental conditions. The Tibetan sheep has adapted to high altitudes on the Qinghai-Tibet Plateau over the past 3000 years. To explore genomic variants associated with high-altitude adaptation in Tibetan sheep, we analyzed Illumina short-reads of 994 whole genomes representing ∼ 60 sheep breeds/populations at varied altitudes, PacBio High fidelity (HiFi) reads of 13 breeds, and 96 transcriptomes from 12 sheep organs. Association testing between the inhabited altitudes and 34,298,967 variants was conducted to investigate the genetic architecture of altitude adaptation. Highly accurate HiFi reads were used to complement the current ovine reference assembly at the most significantly associated β-globin locus and to validate the presence of two haplotypes A and B among 13 sheep breeds. The haplotype A carried two homologous gene clusters: (1) HBE1, HBE2, HBB-like, and HBBC, and (2) HBE1-like, HBE2-like, HBB-like, and HBB; while the haplotype B lacked the first cluster. The high-altitude sheep showed highly frequent or nearly fixed haplotype A, while the low-altitude sheep dominated by haplotype B. We further demonstrated that sheep with haplotype A had an increased hemoglobin–O2 affinity compared with those carrying haplotype B. Another highly associated genomic region contained the EGLN1 gene which showed varied expression between high-altitude and low-altitude sheep. Our results provide evidence that the rapid adaptive evolution of advantageous alleles play an important role in facilitating the environmental adaptation of Tibetan sheep.
研究问题:
青藏高原被誉为“地球第三极”,寒冷、低氧和强紫外线等极端环境造就了其独特的生物多样性。根据前期考古遗址及古DNA证据,青藏高原迟至3000-5000年前才出现家养绵羊的饲养。绵羊作为高原地区重要的家畜,其环境适应性的分子遗传机制研究,能够促进其保种和选育工作,对高原畜牧业发展意义重大。
研究方法:
综合运用大样本(994个绵羊个体)的全基因组二代测序数据,13个绵羊品种的三代(PacBio HiFi)测序数据,以及来自高低海拔分布的12个不同绵羊组织的96个转录本数据,发现复杂的结构变异在藏羊高海拔适应过程的发挥了重要作用,并通过血红蛋白氧气亲和力试验,证实了不同结构变异的单倍型在携氧能力方面存在显著差异。
主要结果:
1. 发现绵羊β-珠蛋白(HBB)位点存在2类单倍型(A和B),A单倍型携带了两个同源基因簇:(i) HBE1,HBE2,HBB-like和HBBC,(ii) HBE1-like,HBE2-like,HBB-like和HBB,而B单倍型仅包含第二个基因簇;
2. β-珠蛋白位点的A单倍型与绵羊的高海拔适应相关,A单倍型绵羊比B单倍型绵羊具有更强的血红蛋白氧气亲和力,更易适应高原环境;
3. 发现另一显著的受选择基因EGLN1,在高低海拔绵羊组织中表现出显著的表达差异,表明该基因在藏羊适应高海拔环境中也发挥着重要作用。
Page qzae030
Original Research
Genome-wide Studies Reveal Genetic Risk Factors for Hepatic Fat Content
Yanni Li, Eline H van den Berg, Alexander Kurilshikov, Dasha V Zhernakova, Ranko Gacesa, Shixian Hu, Esteban A Lopera-Maya, Alexandra Zhernakova, Lifelines Cohort Study, Vincent E de Meijer, Serena Sanna, Robin P F Dullaart, Hans Blokzijl, Eleonora A M Festen, Jingyuan Fu, Rinse K Weersma
View
abstract
Genetic susceptibility to metabolic associated fatty liver disease (MAFLD) is complex and poorly characterized. Accurate characterization of the genetic background of hepatic fat content would provide insights into disease etiology and causality of risk factors. We performed genome-wide association study (GWAS) on two noninvasive definitions of hepatic fat content: magnetic resonance imaging proton density fat fraction (MRI-PDFF) in 16,050 participants and fatty liver index (FLI) in 388,701 participants from the United Kingdom (UK) Biobank (UKBB). Heritability, genetic overlap, and similarity between hepatic fat content phenotypes were analyzed, and replicated in 10,398 participants from the University Medical Center Groningen (UMCG) Genetics Lifelines Initiative (UGLI). Meta-analysis of GWASs of MRI-PDFF in UKBB revealed five statistically significant loci, including two novel genomic loci harboring CREB3L1 (rs72910057-T, P = 5.40E−09) and GCM1 (rs1491489378-T, P = 3.16E−09), respectively, as well as three previously reported loci: PNPLA3, TM6SF2, and APOE. GWAS of FLI in UKBB identified 196 genome-wide significant loci, of which 49 were replicated in UGLI, with top signals in ZPR1 (P = 3.35E−13) and FTO (P = 2.11E−09). Statistically significant genetic correlation (rg) between MRI-PDFF (UKBB) and FLI (UGLI) GWAS results was found (rg = 0.5276, P = 1.45E−03). Novel MRI-PDFF genetic signals (CREB3L1 and GCM1) were replicated in the FLI GWAS. We identified two novel genes for MRI-PDFF and 49 replicable loci for FLI. Despite a difference in hepatic fat content assessment between MRI-PDFF and FLI, a substantial similar genetic architecture was found. FLI is identified as an easy and reliable approach to study hepatic fat content at the population level.
Page qzae031
Original Research
CpG Island Definition and Methylation Mapping of the T2T-YAO Genome
Ming Xiao, Rui Wei, Jun Yu, Chujie Gao, Fengyi Yang, Le Zhang
View
abstract
Precisely defining and mapping all cytosine (C) positions and their clusters, known as CpG islands (CGIs), as well as their methylation status, are pivotal for genome-wide epigenetic studies, especially when population-centric reference genomes are ready for timely application. Here, we first align the two high-quality reference genomes, T2T-YAO and T2T-CHM13, from different ethnic backgrounds in a base-by-base fashion and compute their genome-wide density-defined and position-defined CGIs. Second, by mapping some representative genome-wide methylation data from selected organs onto the two genomes, we find that there are about 4.7%–5.8% sequence divergency of variable categories depending on quality cutoffs. Genes among the divergent sequences are mostly associated with neurological functions. Moreover, CGIs associated with the divergent sequences are significantly different with respect to CpG density and observed CpG/expected CpG (O/E) ratio between the two genomes. Finally, we find that the T2T-YAO genome not only has a greater CpG coverage than that of the T2T-CHM13 genome when whole-genome bisulfite sequencing (WGBS) data from the European and American populations are mapped to each reference, but also shows more hyper-methylated CpG sites as compared to the T2T-CHM13 genome. Our study suggests that future genome-wide epigenetic studies of the Chinese populations rely on both acquisition of high-quality methylation data and subsequent precision CGI mapping based on the Chinese T2T reference.
研究方法:
通过基因组序列比对、基于密度和位置定义的CpG岛的计算以及对选定代表性器官的全基因组重亚硫酸盐甲基化测序(whole genome bisulfite sequencing, WGBS)数据的映射分析,来全面比较两个基因组。方法涉及高质量基因组序列的比较、CpG岛的预测、以及甲基化数据的分析,从而探讨这两个基因组在基因序列和表观遗传标志物上的差异。
主要成果1:
研究发现T2T-YAO和T2T-CHM13两个参考基因组之间在基因序列上存在约4.7-5.8%的差异。这些差异序列相关的基因大多与神经功能有关。
主要成果2:
两个参考基因组的CpG岛(CpG island, CGI)在CpG密度和CpG的观测数/期望数比值(O/E比)方面存在显著统计差异。
主要成果3:
将来自欧美人群的WGBS数据比对到这两个参考基因组,发现T2T-YAO基因组相较T2T-CHM13基因组显示出更高的CpG位点覆盖度和更多的高甲基化CpG位点。
这些差异突显了在表观遗传学研究中,特别是针对特定人群的研究中,选择合适的以人群为中心的参考基因组的重要性。
Page qzae009
Original Research
Proteomic Stratification of Prognosis and Treatment Options for Small Cell Lung Cancer
Zitian Huo, Yaqi Duan, Dongdong Zhan, Xizhen Xu, Nairen Zheng, Jing Cai, Ruifang Sun, Jianping Wang, Fang Cheng, Zhan Gao, Caixia Xu, Wanlin Liu, Yuting Dong, Sailong Ma, Qian Zhang, Yiyun Zheng, Liping Lou, Dong Kuang, Qian Chu, Jun Qin, Guoping Wang, Yi Wang
View
abstract
Small cell lung cancer (SCLC) is a highly malignant and heterogeneous cancer with limited therapeutic options and prognosis prediction models. Here, we analyzed formalin-fixed, paraffin-embedded (FFPE) samples of surgical resections by proteomic profiling, and stratified SCLC into three proteomic subtypes (S-I, S-II, and S-III) with distinct clinical outcomes and chemotherapy responses. The proteomic subtyping was an independent prognostic factor and performed better than current tumor–node–metastasis or Veterans Administration Lung Study Group staging methods. The subtyping results could be further validated using FFPE biopsy samples from an independent cohort, extending the analysis to both surgical and biopsy samples. The signatures of the S-II subtype in particular suggested potential benefits from immunotherapy. Differentially overexpressed proteins in S-III, the worst prognostic subtype, allowed us to nominate potential therapeutic targets, indicating that patient selection may bring new hope for previously failed clinical trials. Finally, analysis of an independent cohort of SCLC patients who had received immunotherapy validated the prediction that the S-II patients had better progression-free survival and overall survival after first-line immunotherapy. Collectively, our study provides the rationale for future clinical investigations to validate the current findings for more accurate prognosis prediction and precise treatments.
研究问题
小细胞肺癌异质性高且缺乏病理及分子分型的有效策略,难以指导临床治疗和预测患者预后。基因组学和转录组学对预后及疗效评估的失败,警示我们单纯依赖基因及转录组学数据,不足以阐明其分子机制,指导临床治疗,需要其他的技术在不同层面进行挖掘和探讨。
研究方法
本研究从华中科技大学同济医学院附属同济医院的75例SCLC手术样本中,通过高效液相色谱串联质谱技术(LC-MS/MS)检测到7028个经过质控的蛋白质,以无监督的机器学习方法鉴定出三种与生存预后相关的SCLC亚型,进一步通过来自河南大学第一附属医院的52例SCLC穿刺样本进行了独立中心验证。另外利用同济医院治疗的52例免疫治疗患者队列评估了各型患者在一线和一线以上免疫治疗的获益情况。
主要结果
1. 蛋白质组学分型可以用于预测小细胞肺癌的预后,且在手术切除的大病理样本和穿刺活检的小标本样品中具有良好的一致性和稳定性。
2. 小细胞肺癌根据蛋白质组学特征可分为三个亚型,这三种亚型在临床预后和化疗响应方面存在显著差异。
3. S-II亚型对免疫治疗的良好响应,尤其是在免疫检查点抑制剂治疗方面。
4. S-III亚型预后差且化疗耐药,但通过蛋白质组学我们可以为几乎每个患者鉴定出至少一个以上潜在的治疗靶点。
Page qzae033
Original Research
Transcriptome Dynamics and Cell Dialogs Between Oocytes and Granulosa Cells in Mouse Follicle Development
Wenju Liu, Chuan Chen, Yawei Gao, Xinyu Cui, Yuhan Zhang, Liang Gu, Yuanlin He, Jing Li, Shaorong Gao, Rui Gao, Cizhong Jiang
View
abstract
The development and maturation of follicles is a sophisticated and multistage process. The dynamic gene expression of oocytes and their surrounding somatic cells and the dialogs between these cells are critical to this process. In this study, we accurately classified the oocyte and follicle development into nine stages and profiled the gene expression of mouse oocytes and their surrounding granulosa cells and cumulus cells. The clustering of the transcriptomes showed the trajectories of two distinct development courses of oocytes and their surrounding somatic cells. Gene expression changes precipitously increased at Type 4 stage and drastically dropped afterward within both oocytes and granulosa cells. Moreover, the number of differentially expressed genes between oocytes and granulosa cells dramatically increased at Type 4 stage, most of which persistently passed on to the later stages. Strikingly, cell communications within and between oocytes and granulosa cells became active from Type 4 stage onward. Cell dialogs connected oocytes and granulosa cells in both unidirectional and bidirectional manners. TGFB2/3, TGFBR2/3, INHBA/B, and ACVR1/1B/2B of TGF-β signaling pathway functioned in the follicle development. NOTCH signaling pathway regulated the development of granulosa cells. Additionally, many maternally DNA methylation- or H3K27me3-imprinted genes remained active in granulosa cells but silent in oocytes during oogenesis. Collectively, Type 4 stage is the key turning point when significant transcription changes diverge the fate of oocytes and granulosa cells, and the cell dialogs become active to assure follicle development. These findings shed new insights on the transcriptome dynamics and cell dialogs facilitating the development and maturation of oocytes and follicles.
研究问题
卵泡发育与成熟是一个复杂的多步骤过程,现有方法对该发育过程分类并不能准确定义卵母细胞和卵泡的发育阶段。此外,卵母细胞并非以孤立的模式发育,与颗粒细胞间的双向交流对卵母细胞的成熟和卵泡的生长至关重要。对这两方面问题的解析有助于我们了解卵泡发育和成熟这一重要生理过程。
研究方法
本研究通过超低通量RNA-seq,结合生物信息学分析,将卵泡发育精确划分为九个阶段,并揭示了卵母细胞与颗粒细胞之间的密切联系。此外针对印记基因在胚胎发育中重要作用,本研究重点关注了印记基因在卵子发生过程中在卵母细胞和周围颗粒细胞中的表达模式。
主要结果
1. Type4是卵母细胞和卵泡发育的重要转折阶段。
2. 卵母细胞与其周围颗粒细胞/卵丘细胞呈现相似的基因表达程序。
3. 卵母细胞和颗粒细胞之间存在广泛的单向与双向细胞对话,其中TGF-β信号通路发挥重要作用。
4. 母源基因和印记基因在卵母细胞和颗粒细胞中呈现不同的表达模式。
数据链接:
https://ngdc.cncb.ac.cn/search/?dbId=gsa&q=+CRA001613
Page qzad001
Original Research
Identification of Highly Repetitive Enhancers with Long-range Regulation Potential in Barley via STARR-seq
Wanlin Zhou, Haoran Shi,Zhiqiang Wang, Yuxin Huang, Lin Ni, Xudong Chen, Yan Liu, Haojie Li, Caixia Li, Yaxi Liu
View
abstract
Enhancers are DNA sequences that can strengthen transcription initiation. However, the global identification of plant enhancers is complicated due to uncertainty in the distance and orientation of enhancers, especially in species with large genomes. In this study, we performed self-transcribing active regulatory region sequencing (STARR-seq) for the first time to identify enhancers across the barley genome. A total of 7323 enhancers were successfully identified, and among 45 randomly selected enhancers, over 75% were effective as validated by a dual-luciferase reporter assay system in the lower epidermis of tobacco leaves. Interestingly, up to 53.5% of the barley enhancers were repetitive sequences, especially transposable elements (TEs), thus reinforcing the vital role of repetitive enhancers in gene expression. Both the common active mark H3K4me3 and repressive mark H3K27me3 were abundant among the barley STARR-seq enhancers. In addition, the functional range of barley STARR-seq enhancers seemed much broader than that of rice or maize and extended to ±100 kb of the gene body, and this finding was consistent with the high expression levels of genes in the genome. This study specifically depicts the unique features of barley enhancers and provides available barley enhancers for further utilization.
研究问题:
增强子是基因组中能增强转录起始的DNA序列,但其与启动子的距离和方向未知,增加了其发掘和研究的难度。在基因组庞大复杂的大麦中鉴定增强子的难度更大。近年来,STARR-seq技术发展,为大麦增强子鉴定提供了新思路。
研究方法:
作者提取了大麦Morex品种的DNA,利用超声随机打断为500-800bp的片段,并通过同源重组法构建在表达载体pBI221上,大提质粒并转染进大麦Morex原生质体中,完成2个报告cDNA文库和2个质粒DNA输入文库的构建。利用Illumina Novaseq-150PE平台进行测序,结合R package、BasicSTARRseq、Bonferroni correction等生物信息学手段鉴定大麦增强子。利用烟草双荧光素酶报告基因系统,开展了部分增强子的效应验证。最后,整合收集了228个大麦样本的RNA-seq数据,完成了大麦增强子位置、序列特征和调控特征分析。
主要结果:
1. 基于大麦原生质体转染,成功构建了2个cDNA报告文库和2个质粒输入文库;
2. 利用STARR-seq技术鉴定到7323个大麦增强子;
3. 烟草系统验证了增强子效应可靠;
4. 高达53.5%的大麦增强子由重复序列组成,且重复序列中绝大部分为转座子;
5. 基于超200份大麦转录组数据,推测大麦增强子的作用范围为±100 kb;
6. 含简单串联重复序列的大麦增强子调控效应最佳;
7. 组蛋白修饰类型H3K4me3和H3K27me3均在增强子位点显著富集。
Page qzae012
Original Research
Global Marine Cold Seep Metagenomes Reveal Diversity of Taxonomy, Metabolic Function, and Natural Products
Tao Yu, Yingfeng Luo, Xinyu Tan, Dahe Zhao, Xiaochun Bi, Chenji Li, Yanning Zheng, Hua Xiang, Songnian Hu
View
abstract
Cold seeps in the deep sea are closely linked to energy exploration as well as global climate change. The alkane-dominated chemical energy-driven model makes cold seeps an oasis of deep-sea life, showcasing an unparalleled reservoir of microbial genetic diversity. Here, by analyzing 113 metagenomes collected from 14 global sites across 5 cold seep types, we present a comprehensive Cold Seep Microbiomic Database (CSMD) to archive the genomic and functional diversity of cold seep microbiomes. The CSMD includes over 49 million non-redundant genes and 3175 metagenome-assembled genomes, which represent 1895 species spanning 105 phyla. In addition, beta diversity analysis indicates that both the sampling site and cold seep type have a substantial impact on the prokaryotic microbiome community composition. Heterotrophic and anaerobic metabolisms are prevalent in microbial communities, accompanied by considerable mixotrophs and facultative anaerobes, highlighting the versatile metabolic potential in cold seeps. Furthermore, secondary metabolic gene cluster analysis indicates that at least 98.81% of the sequences potentially encode novel natural products, with ribosomally synthesized and post-translationally modified peptides being the predominant type widely distributed in archaea and bacteria. Overall, the CSMD represents a valuable resource that would enhance the understanding and utilization of global cold seep microbiomes.
研究问题:
海洋冷泉是特殊的化能营养生态系统,孕育着多样的微生物群落。然而目前对冷泉微生物的了解仍有限,有必要进行全球范围内冷泉微生物组的勘测与比较。
研究方法:
利用元基因组学测序技术,对全球14个冷泉位点、5种冷泉类型共113个样本进行微生物基因组组装、分箱、群落多样性以及功能注释等分析,以解析冷泉微生物的群落特点及功能特征。
主要结果:
1. 构建了以原核生物为主的冷泉微生物组数据集(CSMD)。
2. 阐释了冷泉微生物群落组成以及受地理位点、冷泉类型影响的分布规律。
3. 解析了冷泉微生物代谢功能多样性特征。
4. 揭示了冷泉微生物蕴藏丰富的合成天然产物的潜能。
数据链接:
https://ngdc.cncb.ac.cn/search/?dbId=gsa&q=CRA010074
Page qzad006
Method
BSAlign: A Library for Nucleotide Sequence Alignment
Haojing Shao, Jue Ruan
View
abstract
Increasing the accuracy of the nucleotide sequence alignment is an essential issue in genomics research. Although classic dynamic programming (DP) algorithms (e.g., Smith–Waterman and Needleman–Wunsch) guarantee to produce the optimal result, their time complexity hinders the application of large-scale sequence alignment. Many optimization efforts that aim to accelerate the alignment process generally come from three perspectives: redesigning data structures [e.g., diagonal or striped Single Instruction Multiple Data (SIMD) implementations], increasing the number of parallelisms in SIMD operations (e.g., difference recurrence relation), or reducing search space (e.g., banded DP). However, no methods combine all these three aspects to build an ultra-fast algorithm. In this study, we developed a Banded Striped Aligner (BSAlign) library that delivers accurate alignment results at an ultra-fast speed by knitting a series of novel methods together to take advantage of all of the aforementioned three perspectives with highlights such as active F-loop in striped vectorization and striped move in banded DP. We applied our new acceleration design on both regular and edit distance pairwise alignment. BSAlign achieved 2-fold speed-up than other SIMD-based implementations for regular pairwise alignment, and 1.5-fold to 4-fold speed-up in edit distance-based implementations for long reads. BSAlign is implemented in C programing language and is available at https://github.com/ruanjue/bsalign.
研究问题:
保证准确性的情况下提高DNA序列比对的计算效率是基因组学研究的重要问题。经典算法在时间复杂度上的局限性严重阻碍其在大规模序列比对中的应用。目前并行加速比对的最优算法有三类,包括条纹法、差分法、带宽法。然而,目前并没有任何方法可以高效地结合这三种方法,获得更快速的比对算法。
研究方法:
本研究通过高效结合条纹法、差分法和带宽法这三种方法,提出条纹移动法,以实现条纹法在带宽环境下的高效运算;提出主动F循环法,以解决条纹数据在长插入或删除情况下的多次查询问题;最后开发DNA比对软件BSAlign,并将BSAlign与其它比对算法进行比对性能比较分析。
主要结果1:
本项目提出条纹移动算法和主动F循环算法,可大大减少计算的复杂性。
主要结果2:
BSAlign比对算法的使比对速度比同类并行算法快2倍、使长序列比对比基于编辑距离的比对算法快1.5到4倍。
算法链接:
BSAlign软件免费开发使用:https://github.com/ruanjue/bsalign
Page qzae025
Method
DiffGR: Detecting Differentially Interacting Genomic Regions from Hi-C Contact Maps
Huiling Liu, Wenxiu Ma
View
abstract
Recent advances in high-throughput chromosome conformation capture (Hi-C) techniques have allowed us to map genome-wide chromatin interactions and uncover higher-order chromatin structures, thereby shedding light on the principles of genome architecture and functions. However, statistical methods for detecting changes in large-scale chromatin organization such as topologically associating domains (TADs) are still lacking. Here, we proposed a new statistical method, DiffGR, for detecting differentially interacting genomic regions at the TAD level between Hi-C contact maps. We utilized the stratum-adjusted correlation coefficient to measure similarity of local TAD regions. We then developed a nonparametric approach to identify statistically significant changes of genomic interacting regions. Through simulation studies, we demonstrated that DiffGR can robustly and effectively discover differential genomic regions under various conditions. Furthermore, we successfully revealed cell type-specific changes in genomic interacting regions in both human and mouse Hi-C datasets, and illustrated that DiffGR yielded consistent and advantageous results compared with state-of-the-art differential TAD detection methods. The DiffGR R package is published under the GNU General Public License (GPL) ≥ 2 license and is publicly available at https://github.com/wmalab/DiffGR.
研究问题:
拓扑相关结构域(Topologically associating domains, TADs)与细胞类型特异性的基因表达密切相关,因此对TAD层面基因组区域的差异分析尤为重要。现有算法大多专注于探究TAD边界的变化而非检测差异TAD区域。为了解决这一问题,我们开发了一种新的统计方法——DiffGR,用于检测两个Hi-C矩阵之间在TAD层面上的基因组区域差异。
研究方案:
1. 确定TAD边界:应用可靠的TAD边界检测方法以确定各Hi-C矩阵的TAD边界,并结合两个Hi-C矩阵的TAD边界来定义用于后续分析的候选基因组区域。
2. 测量相似性:引入分层调整相关系数(stratum-adjusted correlation coefficient, SCC)来衡量局部基因组区域相互作用的相似性。
3. 显著性检测:应用非参数排列检验来识别基因组相互作用区域的统计显著性变化。
主要结果:
1. DiffGR在模拟数据中能产生稳健且稳定的检测结果;
2. DiffGR在真实数据中的检测结果得到了ChIP-seq和RNA-seq数据的有效验证;
3. 与现有的TAD边界/区域差异性检测工具相比,DiffGR产生了更一致且有效的结果。
算法链接:
https://github.com/wmalab/DiffGR.(GitHub)
Page qzae028
Method
TransCell: In Silico Characterization of Genomic Landscape and Cellular Responses by Deep Transfer Learning
Shan-Ju Yeh, Shreya Paithankar, Ruoqiao Chen, Jing Xing, Mengying Sun, Ke Liu, Jiayu Zhou, Bin Chen
View
abstract
Gene expression profiling of new or modified cell lines becomes routine today; however, obtaining comprehensive molecular characterization and cellular responses for a variety of cell lines, including those derived from underrepresented groups, is not trivial when resources are minimal. Using gene expression to predict other measurements has been actively explored; however, systematic investigation of its predictive power in various measurements has not been well studied. Here, we evaluated commonly used machine learning methods and presented TransCell, a two-step deep transfer learning framework that utilized the knowledge derived from pan-cancer tumor samples to predict molecular features and responses. Among these models, TransCell had the best performance in predicting metabolite, gene effect score (or genetic dependency), and drug sensitivity, and had comparable performance in predicting mutation, copy number variation, and protein expression. Notably, TransCell improved the performance by over 50% in drug sensitivity prediction and achieved a correlation of 0.7 in gene effect score prediction. Furthermore, predicted drug sensitivities revealed potential repurposing candidates for new 100 pediatric cancer cell lines, and predicted gene effect scores reflected BRAF resistance in melanoma cell lines. Together, we investigated the predictive power of gene expression in six molecular measurement types and developed a web portal (http://apps.octad.org/transcell/) that enables the prediction of 352,000 genomic and cellular response features solely from gene expression profiles.
Page qzad008
Method
APIR: Aggregating Universal Proteomics Database Search Algorithms for Peptide Identification with FDR Control
Yiling Elaine Chen, Xinzhou Ge, Kyla Woyshner, MeiLu McDermott, Antigoni Manousopoulou, Scott B Ficarro, Jarrod A Marto, Kexin Li, Leo David Wang, Jingyi Jessica Li
View
abstract
Advances in mass spectrometry (MS) have enabled high-throughput analysis of proteomes in biological systems. The state-of-the-art MS data analysis relies on database search algorithms to quantify proteins by identifying peptide–spectrum matches (PSMs), which convert mass spectra to peptide sequences. Different database search algorithms use distinct search strategies and thus may identify unique PSMs. However, no existing approaches can aggregate all user-specified database search algorithms with a guaranteed increase in the number of identified peptides and a control on the false discovery rate (FDR). To fill in this gap, we proposed a statistical framework, Aggregation of Peptide Identification Results (APIR), that is universally compatible with all database search algorithms. Notably, under an FDR threshold, APIR is guaranteed to identify at least as many, if not more, peptides as individual database search algorithms do. Evaluation of APIR on a complex proteomics standard dataset showed that APIR outpowers individual database search algorithms and empirically controls the FDR. Real data studies showed that APIR can identify disease-related proteins and post-translational modifications missed by some individual database search algorithms. The APIR framework is easily extendable to aggregating discoveries made by multiple algorithms in other high-throughput biomedical data analysis, e.g., differential gene expression analysis on RNA sequencing data. The APIR R package is available at https://github.com/yiling0210/APIR.
研究问题:
蛋白质组学数据分析依赖于数据库搜索算法实现质谱图至肽段序列的转换。尽管现有的数据库搜索算法能够鉴定肽段谱图匹配(peptide-spectrum matches,PSMs),但不同数据库检索算法往往会识别出不同的、相对独特的PSM集合。因此,如何在控制错误发现率(false discovery rate,FDR)的前提下,有效地聚合多种数据库搜索算法的输出结果,以提高肽段的鉴定能力,是当前蛋白质组学研究中亟待解决的问题。
研究方法:
我们提出了APIR(Aggregation of Peptide Identification Results)这一统计框架来聚合多个数据库搜索算法输出的PSM集合。APIR是一个能够应用于各个数据库搜索算法的序列聚合框架,旨在控制聚合后的PSM集合的错误发现率(FDR)。在给定多个数据库搜索算法的输出结果的情况下,APIR从这些输出结果中找到不相交的PSM集合。APIR在每一个集合上保证FDR的控制,最终输出这些不相交集合的并集。
主要结果
与现有的其他聚合方法相比,APIR具有以下三个优势:首先,APIR是开源的,能通用于所有能输出包含匹配评分(例如q值或后验错误概率)的PSM集合的数据库搜索算法;其次,APIR可以保证输出比单个数据库搜索算法更多,至少相同数量的肽段;第三,APIR在模拟和真实数据上可以有效控制FDR。因此,APIR是一个有效而灵活的框架,可以在控制FDR的同时,提高从鸟枪蛋白质组学(shotgun proteomics)数据中鉴定肽段的能力。
算法链接
APIR软件免费开放使用:https://github.com/yiling0210/APIR
Page qzae042
Method
DVsc: An Automated Framework for Efficiently Detecting Viral Infection from Single-cell Transcriptomics Data
Fei Leng, Song Mei, Xiaolin Zhou, Xuanshi Liu, Yefeng Yuan, Wenjian Xu, Chongyi Hao, Ruolan Guo, Chanjuan Hao, Wei Li, Peng Zhang
View
abstract
Single-cell RNA sequencing (scRNA-seq) has emerged as a valuable tool for studying cellular heterogeneity in various fields, particularly in virological research. By studying the viral and cellular transcriptomes, the dynamics of viral infection can be investigated at a single-cell resolution. However, limited studies have been conducted to investigate whether RNA transcripts from clinical samples contain substantial amounts of viral RNAs, and a specific computational framework for efficiently detecting viral reads based on scRNA-seq data has not been developed. Hence, we introduce DVsc, an open-source framework for precise quantitative analysis of viral infection from single-cell transcriptomics data. When applied to approximately 200 diverse clinical samples that were infected by more than 10 different viruses, DVsc demonstrated high accuracy in systematically detecting viral infection across a wide array of cell types. This innovative bioinformatics pipeline could be crucial for addressing the potential effects of surreptitiously invading viruses on certain illnesses, as well as for designing novel medicines to target viruses in specific host cell subsets and evaluating the efficacy of treatment. DVsc supports the FASTQ format as an input and is compatible with multiple single-cell sequencing platforms. Moreover, it could also be applied to sequences from bulk RNA sequencing data. DVsc is available at http://62.234.32.33:5000/DVsc.
Page qzad007
correction
Correction to: m6A Profile Dynamics Indicates Regulation of Oyster Development by m6A-RNA Epitranscriptomes
Lorane Le Franc, Bruno Petton, Pascal Favrel, Guillaume Rivière
View
abstract
Page qzae021
correction
Correction to: Single-cell RNA Sequencing Reveals Sexually Dimorphic Transcriptome and Type 2 Diabetes Genes in Mouse Islet β Cells
Gang Liu, Yana Li, Tengjiao Zhang, Mushan Li, Sheng Li, Qing He, Shuxin Liu, Minglu Xu, Tinghui Xiao, Zhen Shao, Weiyang Shi, Weida Li
View
abstract
Page qzae022