Commentary
On the Responsible Use of Chatbots in Bioinformatics
Gangqing Hu, Li Liu, Dong Xu
View
abstract
Page qzae002
Review Article
Microbiome in Female Reproductive Health: Implications for Fertility and Assisted Reproductive Technologies
Liwen Xiao, Zhenqiang Zuo, Fangqing Zhao
View
abstract
The microbiome plays a critical role in the process of conception and the outcomes of pregnancy. Disruptions in microbiome homeostasis in women of reproductive age can lead to various pregnancy complications, which significantly impact maternal and fetal health. Recent studies have associated the microbiome in the female reproductive tract (FRT) with assisted reproductive technology (ART) outcomes, and restoring microbiome balance has been shown to improve fertility in infertile couples. This review provides an overview of the role of the microbiome in female reproductive health, including its implications for pregnancy outcomes and ARTs. Additionally, recent advances in the use of microbial biomarkers as indicators of pregnancy disorders are summarized. A comprehensive understanding of the characteristics of the microbiome before and during pregnancy and its impact on reproductive health will greatly promote maternal and fetal health. Such knowledge can also contribute to the development of ARTs and microbiome-based interventions.
Page qzad005
Review Article
RNase P: Beyond Precursor tRNA Processing
Peipei Wang, Juntao Lin, Xiangyang Zheng, Xingzhi Xu
View
abstract
Ribonuclease P (RNase P) was first described in the 1970’s as an endoribonuclease acting in the maturation of precursor transfer RNAs (tRNAs). More recent studies, however, have uncovered non-canonical roles for RNase P and its components. Here, we review the recent progress of its involvement in chromatin assembly, DNA damage response, and maintenance of genome stability with implications in tumorigenesis. The possibility of RNase P as a therapeutic target in cancer is also discussed.
Page qzae016
Review Article
Substrate and Functional Diversity of Protein Lysine Post-translational Modifications
Bingbing Hao, Kaifeng Chen, Linhui Zhai, Muyin Liu, Bin Liu, Minjia Tan
View
abstract
Lysine post-translational modifications (PTMs) are widespread and versatile protein PTMs that are involved in diverse biological processes by regulating the fundamental functions of histone and non-histone proteins. Dysregulation of lysine PTMs is implicated in many diseases, and targeting lysine PTM regulatory factors, including writers, erasers, and readers, has become an effective strategy for disease therapy. The continuing development of mass spectrometry (MS) technologies coupled with antibody-based affinity enrichment technologies greatly promotes the discovery and decoding of PTMs. The global characterization of lysine PTMs is crucial for deciphering the regulatory networks, molecular functions, and mechanisms of action of lysine PTMs. In this review, we focus on lysine PTMs, and provide a summary of the regulatory enzymes of diverse lysine PTMs and the proteomics advances in lysine PTMs by MS technologies. We also discuss the types and biological functions of lysine PTM crosstalks on histone and non-histone proteins and current druggable targets of lysine PTM regulatory factors for disease therapy.
赖氨酸翻译后修饰(Lysine PTMs)是广泛存在于蛋白质赖氨酸残基上的翻译后修饰类型,其通过调节组蛋白和非组蛋白的基本功能参与细胞内多种生物过程。赖氨酸修饰的失调与许多疾病有关,目前靶向赖氨酸修饰调节因子已成为疾病治疗的有效策略。生物质谱技术与基于抗体的亲和富集技术的不断发展极大地促进了蛋白质翻译后修饰的发现和功能解析。赖氨酸修饰的全局表征对于破译赖氨酸修饰的调节网络、分子功能和作用机制至关重要。在这篇综述中,我们总结了赖氨酸修饰的相应调控酶以及基于生物质谱技术的赖氨酸修饰蛋白质组学进展。此外,我们还讨论了组蛋白和非组蛋白上赖氨酸修饰间交互作用的类型和生物学功能,以及目前可用于疾病治疗的赖氨酸修饰调节因子药物靶点。
Page qzae019
Original Research
Whole-genome Sequencing Reveals Autooctoploidy in Chinese Sturgeon and Its Evolutionary Trajectories
Binzhong Wang, Bin Wu, Xueqing Liu, Yacheng Hu, Yao Ming, Mingzhou Bai, Juanjuan Liu, Kan Xiao, Qingkai Zeng, Jing Yang, Hongqi Wang, Baifu Guo, Chun Tan, Zixuan Hu, Xun Zhao, Yanhong Li, Zhen Yue, Junpu Mei, Wei Jiang, Yuanjin Yang, Zhiyuan Li, Yong Gao, Lei Chen, Jianbo Jian, Hejun Du
View
abstract
The order Acipenseriformes, which includes sturgeons and paddlefishes, represents “living fossils” with complex genomes that are good models for understanding whole-genome duplication (WGD) and ploidy evolution in fishes. Here, we sequenced and assembled the first high-quality chromosome-level genome for the complex octoploid Acipenser sinensis (Chinese sturgeon), a critically endangered species that also represents a poorly understood ploidy group in Acipenseriformes. Our results show that A. sinensis is a complex autooctoploid species containing four kinds of octovalents (8n), a hexavalent (6n), two tetravalents (4n), and a divalent (2n). An analysis taking into account delayed rediploidization reveals that the octoploid genome composition of Chinese sturgeon results from two rounds of homologous WGDs, and further provides insights into the timing of its ploidy evolution. This study provides the first octoploid genome resource of Acipenseriformes for understanding ploidy compositions and evolutionary trajectories of polyploid fishes.
研究问题
如何构建高质量的中华鲟单倍型参考基因组序列?基于该参考基因组解析鲟鱼研究领域内,针对现存鲟形目物种倍性进化的世纪争论:以中华鲟为代表的类群是四倍体还是八倍体、鲟形目多倍体来源于同源还是异源全基因组加倍(WGD)、WGD和鲟形目物种分化的时序关系、鲟形目分化时间?
研究方法
利用人工诱导方法制备的中华鲟雌核发育个体,基于PacBio测序数据,采用多种基因组组装方法,对各高质量的组装结果进行融合。利用二代Illumina测序数据进行矫正,采用Hi-C完成基因组的染色体挂载。对中华鲟正常发育个体进行重测序,开展k-mer、SSR、SNP及TE分析判断中华鲟倍性和同源性。分离谱系特异性全基因组加倍基因(lineage-specific ohnologue resolution, LORe)和祖先全基因组加倍基因(ancestral ohnologue resolution, AORe),利用AORe的分化判定鲟科和匙吻鲟科/白鲟科的物种形成和WGD过程。
主要结果
1、首次实现中华鲟全基因组组装和信息的解析。完成66条(2个单倍型,2 monoploids)参考基因组染色体序列的组装,BUSCO的完整度评估结果达95.6%,并进行了基因功能注释。这也是目前解析的首个八倍体动物的基因组,为中华鲟等濒危鲟鱼保护研究和经济类鲟鱼育种提供了重要参考数据。
2、解析了中华鲟的倍性组分,表明其为复杂倍性组成的八倍体物种,推测其存在的延迟二倍化现象和两次同源WGD造成了目前复杂倍性状态。为多倍体物种研究和脊椎动物进化提供了重要参考数据。
3、推测鲟鱼类和匙吻鲟类的共同祖先在约210百万年前经历了鲟形目特异的第一次WGD(As3R),在约150百万年前两类鲟鱼发生分化,约35百万前中华鲟祖先发生了第二次谱系特异性的WGD。该结果为鲟形目鱼类进化提供了重要参考。
Page qzad002
Original Research
Integrated Single-cell Multiomic Analysis of HIV Latency Reversal Reveals Novel Regulators of Viral Reactivation
Manickam Ashokkumar, Wenwen Mei, Jackson J Peterson, Yuriko Harigaya, David M Murdoch, David M Margolis, Caleb Kornfein, Alex Oesterling, Zhicheng Guo, Cynthia D Rudin, Yuchao Jiang, Edward P Browne
View
abstract
Despite the success of antiretroviral therapy, human immunodeficiency virus (HIV) cannot be cured because of a reservoir of latently infected cells that evades therapy. To understand the mechanisms of HIV latency, we employed an integrated single-cell RNA sequencing (scRNA-seq) and single-cell assay for transposase-accessible chromatin with sequencing (scATAC-seq) approach to simultaneously profile the transcriptomic and epigenomic characteristics of ∼ 125,000 latently infected primary CD4+ T cells after reactivation using three different latency reversing agents. Differentially expressed genes and differentially accessible motifs were used to examine transcriptional pathways and transcription factor (TF) activities across the cell population. We identified cellular transcripts and TFs whose expression/activity was correlated with viral reactivation and demonstrated that a machine learning model trained on these data was 75%–79% accurate at predicting viral reactivation. Finally, we validated the role of two candidate HIV-regulating factors, FOXP1 and GATA3, in viral transcription. These data demonstrate the power of integrated multimodal single-cell analysis to uncover novel relationships between host cell factors and HIV latency.
Page qzae003
Original Research
Molecular Evolution of Protein Sequences and Codon Usage in Monkeypox Viruses
Ke-Jia Shan, Changcheng Wu, Xiaolu Tang, Roujian Lu, Yaling Hu, Wenjie Tan, Jian Lu
View
abstract
The monkeypox virus (mpox virus, MPXV) epidemic in 2022 has posed a significant public health risk. Yet, the evolutionary principles of MPXV remain largely unknown. Here, we examined the evolutionary patterns of protein sequences and codon usage in MPXV. We first demonstrated the signal of positive selection in OPG027, specifically in the Clade I lineage of MPXV. Subsequently, we discovered accelerated protein sequence evolution over time in the variants responsible for the 2022 outbreak. Furthermore, we showed strong epistasis between amino acid substitutions located in different genes. The codon adaptation index (CAI) analysis revealed that MPXV genes tended to use more non-preferred codons compared to human genes, and the CAI decreased over time and diverged between clades, with Clade I > IIa and IIb-A > IIb-B. While the decrease in fatality rate among the three groups aligned with the CAI pattern, it remains unclear whether this correlation was coincidental or if the deoptimization of codon usage in MPXV led to a reduction in fatality rates. This study sheds new light on the mechanisms that govern the evolution of MPXV in human populations.
该研究揭示了在人群传播过程中,猴痘病毒(monkeypox virus, mpox virus, MPXV)蛋白质序列和密码子使用的分子演化规律。该研究发现MPXV演化过程中OPG027(VACV-Cop C7L,干扰素抑制基因)可能受到正选择。其次,该研究发现引起2022年猴痘疫情的MPXV变异株中不同基因的氨基酸变异可能具有上位效应(epistasis)。最后,该研究发现不同进化分支中MPXV变异株致死率的下降可能与密码子使用的去优化相关。
Page qzad003
Method
Pindel-TD: A Tandem Duplication Detector Based on A Pattern Growth Approach
Xiaofei Yang, Gaoyang Zheng, Peng Jia, Songbo Wang, Kai Ye
View
abstract
Tandem duplication (TD) is a major type of structural variations (SVs) that plays an important role in novel gene formation and human diseases. However, TDs are often missed or incorrectly classified as insertions by most modern SV detection methods due to the lack of specialized operation on TD-related mutational signals. Herein, we developed a TD detection module for the Pindel tool, referred to as Pindel-TD, based on a TD-specific pattern growth approach. Pindel-TD is capable of detecting TDs with a wide size range at single nucleotide resolution. Using simulated and real read data from HG002, we demonstrated that Pindel-TD outperforms other leading methods in terms of precision, recall, F1-score, and robustness. Furthermore, by applying Pindel-TD to data generated from the K562 cancer cell line, we identified a TD located at the seventh exon of SAGE1, providing an explanation for its high expression. Pindel-TD is available for non-commercial use at https://github.com/xjtu-omics/pindel.
研究问题:
基因组串联重复是一类在新基因形成和人类疾病发生发展中发挥重要功能的结构变异。然而,当前结构变异检测方法缺乏针对串联重复的变异信号进行独立建模,通常会漏检或者错误地将串联重复鉴定为插入。设计精准,完整的基因组串联重复检测算法是本文研究的主要问题。
研究方案:
• 提出基因组串联重复检测算法Pindel-TD
• 针对不同尺寸的串联重复变异构建了不同的读段比对模型
• 针对不同的比对信号,设计了相应的模式增长算法,实现单碱基分辨率,全尺寸串联重复检测
主要结果1:
基于模拟测序数据对算法性能进行评估,证明了Pindel-TD算法在精确性、召回率、F1分数和鲁棒性方面均优于其他算法。
主要结果2:
在HG002真实测序数据进行算法性能评估,验证了Pindel-TD算法优于其他算法。
主要结果3:
应用Pindel-TD至K562癌症细胞系测序数据,发现癌症抗原编码基因SAGE1的第七个外显子发生串联重复变异,潜在解释了SAGE1基因在癌症中高表达的原因。
算法链接:
https://github.com/xjtu-omics/pindel
Page qzae008
Method
NextPolish2: A Repeat-aware Polishing Tool for Genomes Assembled Using HiFi Long Reads
Jiang Hu, Zhuo Wang, Fan Liang, Shan-Lin Liu, Kai Ye, De-Peng Wang
View
abstract
The high-fidelity (HiFi) long-read sequencing technology developed by PacBio has greatly improved the base-level accuracy of genome assemblies. However, these assemblies still contain base-level errors, particularly within the error-prone regions of HiFi long reads. Existing genome polishing tools usually introduce overcorrections and haplotype switch errors when correcting errors in genomes assembled from HiFi long reads. Here, we describe an upgraded genome polishing tool — NextPolish2, which can fix base errors remaining in those “highly accurate” genomes assembled from HiFi long reads without introducing excessive overcorrections and haplotype switch errors. We believe that NextPolish2 has a great significance to further improve the accuracy of telomere-to-telomere (T2T) genomes. NextPolish2 is freely available at https://github.com/Nextomics/NextPolish2.
Page qzad009
Method
FP-Zernike: An Open-source Structural Database Construction Toolkit for Fast Structure Retrieval
Junhai Qi, Chenjie Feng, Yulin Shi, Jianyi Yang, Fa Zhang, Guojun Li, Renmin Han
View
abstract
The release of AlphaFold2 has sparked a rapid expansion in protein model databases. Efficient protein structure retrieval is crucial for the analysis of structure models, while measuring the similarity between structures is the key challenge in structural retrieval. Although existing structure alignment algorithms can address this challenge, they are often time-consuming. Currently, the state-of-the-art approach involves converting protein structures into three-dimensional (3D) Zernike descriptors and assessing similarity using Euclidean distance. However, the methods for computing 3D Zernike descriptors mainly rely on structural surfaces and are predominantly web-based, thus limiting their application in studying custom datasets. To overcome this limitation, we developed FP-Zernike, a user-friendly toolkit for computing different types of Zernike descriptors based on feature points. Users simply need to enter a single line of command to calculate the Zernike descriptors of all structures in customized datasets. FP-Zernike outperforms the leading method in terms of retrieval accuracy and binary classification accuracy across diverse benchmark datasets. In addition, we showed the application of FP-Zernike in the construction of the descriptor database and the protocol used for the Protein Data Bank (PDB) dataset to facilitate the local deployment of this tool for interested readers. Our demonstration contained 590,685 structures, and at this scale, our system required only 4–9 s to complete a retrieval. The experiments confirmed that it achieved the state-of-the-art accuracy level. FP-Zernike is an open-source toolkit, with the source code and related data accessible at https://ngdc.cncb.ac.cn/biocode/tools/BT007365/releases/0.1, as well as through a webserver at http://www.structbioinfo.cn/.
研究问题:
如何实现快速衡量蛋白质(RNA)结构之间的相似性?
如何快速的在一个结构数据库中找到与查询结构相似的结构?
如何便于生物学家的使用,以进一步开展问题的研究?
研究方案:
提出了一种基于Zernike矩的结构比对算法FP-Zernike;
将结构转换成特征向量,构建特征向量数据库。将蛋白质结构查询问题转换成特征向量的欧式距离计算问题;
在提供源代码的同时,构建可视化的网络服务。
主要结果1:
利用多种结构数据集,包括蛋白质结构数据集以及RNA结构数据集,来验证FP-Zernike的性能。
主要结果2:
针对目前的蛋白质结构数据库,构建对应的特征向量数据库,并以此构建蛋白质结构查询系统(网站服务)。
主要结果3:
构建的结构查询服务网站与当前先进的检索系统进行比较分析,验证实用性。
算法链接:
https://github.com/junhaiqi/FP-Zernike;
https://ngdc.cncb.ac.cn/biocode/tools/BT007365/releases/0.1;
网站链接:
http://www.structbioinfo.cn/
Page qzae007
Database
GametesOmics: A Comprehensive Multi-omics Database for Exploring the Gametogenesis in Humans and Mice
Jianting An, Jing Wang, Siming Kong, Shi Song, Wei Chen, Peng Yuan, Qilong He, Yidong Chen, Ye Li, Yi Yang, Wei Wang, Rong Li, Liying Yan, Zhiqiang Yan, Jie Qiao
View
abstract
Gametogenesis plays an important role in the reproduction and evolution of species. The transcriptomic and epigenetic alterations in this process can influence the reproductive capacity, fertilization, and embryonic development. The rapidly increasing single-cell studies have provided valuable multi-omics resources. However, data from different layers and sequencing platforms have not been uniformed and integrated, which greatly limits their use for exploring the molecular mechanisms that underlie oogenesis and spermatogenesis. Here, we develop GametesOmics, a comprehensive database that integrates the data of gene expression, DNA methylation, and chromatin accessibility during oogenesis and spermatogenesis in humans and mice. GametesOmics provides a user-friendly website and various tools, including Search and Advanced Search for querying the expression and epigenetic modification(s) of each gene; Tools with Differentially expressed gene (DEG) analysis for identifying DEGs, Correlation analysis for demonstrating the genetic and epigenetic changes, Visualization for displaying single-cell clusters and screening marker genes as well as master transcription factors (TFs), and MethylView for studying the genomic distribution of epigenetic modifications. GametesOmics also provides Genome Browser and Ortholog for tracking and comparing gene expression, DNA methylation, and chromatin accessibility between humans and mice. GametesOmics offers a comprehensive resource for biologists and clinicians to decipher the cell fate transition in germ cell development, and can be accessed at http://gametesomics.cn/.
研究问题
近年来,单细胞测序技术快速发展,提供了大量的组学数据,为探究配子发生过程的调控机制提供了极具价值的数据资源。然而,目前尚缺乏对配子发育过程的、来自不同研究的多组学测序数据的系统分析与整合,这极大地限制了这些数据的使用和深入挖掘。因此,亟待建立一个多组学数据库,以更好地帮助研究人员获得和挖掘这些多组学信息。
研究方法
我们通过文献检索,获得了包含人类和小鼠配子发生的单细胞测序数据集。考虑到数据质量和可比性,我们尽可能选择覆盖发育阶段最完整、样本数量最多和测序方法最先进的数据集,并使用统一的分析流程对其进行处理和整合。
主要结果
我们构建了人类和小鼠生殖细胞发育时期多组学数据库GametesOmics。我们使用统一的生物信息学标准化分析流程,全面整合了包括转录组、DNA甲基化、染色质开放性在内的多组学信息,涵盖了精子和卵子发育过程中的各个阶段。GametesOmics还提供了多样化的工具,帮助研究人员使用和挖掘多组学数据,探索人类和小鼠配子发育过程中的关键因子,挖掘配子发生过程中细胞命运决定的调控机制。
引言
配子发生是物种繁衍和进化的关键过程,涉及遗传和表观遗传学变化等复杂的分子调控机制。单细胞测序技术的快速发展推动了我们对配子发生过程的研究和理解,并提供了大量多组学数据。然而,针对不同测序平台和测序方法得到的数据,很难对其综合利用分析,这极大阻碍了对卵子和精子发生的分子机制的探索,亟需构建一个包含生殖细胞各个发育阶段的多组学数据库。因此,我们开发了一个针对人类和小鼠卵子和精子发生的多组学数据库GametesOmics(http://gametesomics. cn/)。
Page qzad004
Database
HCCDB v2.0: Decompose Expression Variations by Single-cell RNA-seq and Spatial Transcriptomics in HCC
Ziming Jiang, Yanhong Wu, Yuxin Miao, Kaige Deng, Fan Yang, Shuhuan Xu, Yupeng Wang, Renke You, Lei Zhang, Yuhan Fan, Wenbo Guo, Qiuyu Lian, Lei Chen, Xuegong Zhang, Yongchang Zheng, Jin Gu
View
abstract
Large-scale transcriptomic data are crucial for understanding the molecular features of hepatocellular carcinoma (HCC). Integrated 15 transcriptomic datasets of HCC clinical samples, the first version of HCC database (HCCDB v1.0) was released in 2018. Through the meta-analysis of differentially expressed genes and prognosis-related genes across multiple datasets, it provides a systematic view of the altered biological processes and the inter-patient heterogeneities of HCC with high reproducibility and robustness. With four years having passed, the database now needs integration of recently published datasets. Furthermore, the latest single-cell and spatial transcriptomics have provided a great opportunity to decipher complex gene expression variations at the cellular level with spatial architecture. Here, we present HCCDB v2.0, an updated version that combines bulk, single-cell, and spatial transcriptomic data of HCC clinical samples. It dramatically expands the bulk sample size by adding 1656 new samples from 11 datasets to the existing 3917 samples, thereby enhancing the reliability of transcriptomic meta-analysis. A total of 182,832 cells and 69,352 spatial spots are added to the single-cell and spatial transcriptomics sections, respectively. A novel single-cell level and 2-dimension (sc-2D) metric is proposed as well to summarize cell type-specific and dysregulated gene expression patterns. Results are all graphically visualized in our online portal, allowing users to easily retrieve data through a user-friendly interface and navigate between different views. With extensive clinical phenotypes and transcriptomic data in the database, we show two applications for identifying prognosis-associated cells and tumor microenvironment. HCCDB v2.0 is available at http://lifeome.net/database/hccdb2.
研究问题:
肝细胞肝癌(hepatocellular carcinoma,HCC)是所有原发性肝癌中最普遍的一种,占比达到75%至85%。在之前的研究中,我们成功构建了一个大型转录组数据库HCCDB,并通过荟萃分析技术提高了分析结果的稳定性,获得了广泛认可。然而,随着新的转录组数据源源不断地涌现出来,我们面临着更新数据的紧迫挑战。此外,最新的单细胞测序和空间转录组技术为在单细胞层面上详细解读肝癌的转录组模式与异质性开辟了新天地,同时也带来了前所未有的挑战。
研究方法:
在本研究中,我们对HCCDB数据库进行了更新,推出了更全面的版本HCCDB v2.0。此次升级整合了来自11个新数据集的1656个样本,以及182,832个单细胞和69,352个空间位置点的信息。我们通过荟萃分析方法分析了差异表达基因和与预后相关的基因,并深入探究了肝癌在单细胞和空间层面的基因表达异质性。同时,引入了一个创新的单细胞二维度量(single-cell level and 2-dimension, sc-2D),用以更精确地描述特定细胞类型和基因表达模式。
主要结果:
1. HCCDB v2.0的升级显著增加了样本量,进一步加强了转录组荟萃分析的可靠性与代表性。
2. 新引入的单细胞和空间转录组数据为肝细胞癌研究搭建了一个全景平台,支持从bulk转录组到单细胞分辨率的基因表达模式分析。
3. 通过精准鉴定与不良预后相关的细胞亚群和肿瘤微环境类型,该数据库为HCC的临床研究提供了宝贵资源。
4. 全新设计的网络门户界面用户友好,简化了数据检索流程,允许用户轻松在不同视图间切换,显著提升了数据库的易用性和交互性。
Page qzae011
Database
MARS and RNAcmap3: The Master Database of All Possible RNA Sequences Integrated with RNAcmap for RNA Homology Search
Ke Chen, Thomas Litfin, Jaswinder Singh, Jian Zhan, Yaoqi Zhou
View
abstract
Recent success of AlphaFold2 in protein structure prediction relied heavily on co-evolutionary information derived from homologous protein sequences found in the huge, integrated database of protein sequences (Big Fantastic Database). In contrast, the existing nucleotide databases were not consolidated to facilitate wider and deeper homology search. Here, we built a comprehensive database by incorporating the non-coding RNA (ncRNA) sequences from RNAcentral, the transcriptome assembly and metagenome assembly from metagenomics RAST (MG-RAST), the genomic sequences from Genome Warehouse (GWH), and the genomic sequences from MGnify, in addition to the nucleotide (nt) database and its subsets in National Center of Biotechnology Information (NCBI). The resulting Master database of All possible RNA sequences (MARS) is 20-fold larger than NCBI’s nt database or 60-fold larger than RNAcentral. The new dataset along with a new split–search strategy allows a substantial improvement in homology search over existing state-of-the-art techniques. It also yields more accurate and more sensitive multiple sequence alignments (MSAs) than manually curated MSAs from Rfam for the majority of structured RNAs mapped to Rfam. The results indicate that MARS coupled with the fully automatic homology search tool RNAcmap will be useful for improved structural and functional inference of ncRNAs and RNA language models based on MSAs. MARS is accessible at https://ngdc.cncb.ac.cn/omix/release/OMIX003037, and RNAcmap3 is accessible at http://zhouyq-lab.szbl.ac.cn/download/.
研究问题:
AlphaFold2 在蛋白质结构预测中的成功在相当程度上依赖于从蛋白质同源序列中获得的共进化信息,其同源序列提取自蛋白质序列的巨型整合数据库 Big Fantastic Database(BFD)。反观现有的核酸数据库群,尚缺乏为更广和更深的同源序列搜索整理的完备序列数据库。
研究方法:
在NCBI nt及其分支数据库的基础上,通过整合来自 RNAcentral 的非编码RNA序列,来自 MG-RAST 的宏转录组和宏基因组序列,以及来自GWH和 MGnify 的基因组序列,构建完备的RNA序列集大成数据库 MARS,并针对巨型数据库改进RNAcmap同源序列搜索流程。
主要成果:
本工作构建的 MARS 数据库序列规模约为 NCBI nt 数据库的20倍,或 RNAcentral 数据库的60倍。基于分卷搜索策略的新一代RNAcmap(RNAcmap3)在MARS上实现了超越当今最高水准技术的同源搜索表现。对于能够映射到 Rfam的有结构RNA,这一套新工具能为大多数目标序列获得比人工校对结果更精确和更敏感的多序列对齐结果。
Page qzae018
Database
Q-BioLiP: A Comprehensive Resource for Quaternary Structure-based Protein–ligand Interactions
Hong Wei, Wenkai Wang, Zhenling Peng, Jianyi Yang
View
abstract
Since its establishment in 2013, BioLiP has become one of the widely used resources for protein–ligand interactions. Nevertheless, several known issues occurred with it over the past decade. For example, the protein–ligand interactions are represented in the form of single chain-based tertiary structures, which may be inappropriate as many interactions involve multiple protein chains (known as quaternary structures). We sought to address these issues, resulting in Q-BioLiP, a comprehensive resource for quaternary structure-based protein–ligand interactions. The major features of Q-BioLiP include: (1) representing protein structures in the form of quaternary structures rather than single chain-based tertiary structures; (2) pairing DNA/RNA chains properly rather than separation; (3) providing both experimental and predicted binding affinities; (4) retaining both biologically relevant and irrelevant interactions to alleviate the wrong justification of ligands’ biological relevance; and (5) developing a new quaternary structure-based algorithm for the modelling of protein–ligand complex structure. With these new features, Q-BioLiP is expected to be a valuable resource for studying biomolecule interactions, including protein–small molecule interaction, protein–metal ion interaction, protein–peptide interaction, protein–protein interaction, protein–DNA/RNA interaction, and RNA–small molecule interaction. Q-BioLiP is freely available at https://yanglab.qd.sdu.edu.cn/Q-BioLiP/.
研究问题:
很多蛋白质的生物学功能是通过与其他生物分子互作用来实现的,这些分子称为配体。BioLiP自从2013年发表以来,成为研究蛋白质-配体互作用领域的最受欢迎的数据库之一。然而,BioLiP中蛋白质是用基于单链的三级结构来表示的,当配体同时与蛋白质的多条链发生互作用时,无法完整呈现蛋白质与配体的互作用。因此,需要建立一个全面的基于完整四级结构的数据库,以便更好地研究蛋白质-配体互作用。
研究方法:
基于PDB数据库中所有最小非对称单元条目(mmCIF 格式)存储的旋转矩阵及平移向量生成四级结构,并提取受体和配体结构。针对DNA/RNA,提出配对算法对多条核酸链进行配对。利用半自动化程序判别每一个蛋白质-配体互作用数据的生物相关性。
主要结果:
我们构建了基于完整四级结构的蛋白质-配体互作用数据库Q-BioLiP,主要改进包括:将蛋白质结构从三级结构改进为四级结构;同时保留生物相关和不相关的互作用数据,降低生物相关性误判;对DNA/RNA进行配对;提供了几乎所有互作用类型的数据集,包括蛋白质-小分子、蛋白质-金属离子、蛋白质-多肽、蛋白质-蛋白质、蛋白质-DNA/RNA、RNA-小分子;使用mmCIF格式而不是PDB格式,从而可以处理大结构;提供了预测的亲和力数据;提出支持三级结构和四级结构的蛋白质-配体结合位点预测算法。
Page qzae001
Database
KoNA: Korean Nucleotide Archive as A New Data Repository for Nucleotide Sequence Data
Gunhwan Ko, Jae Ho Lee, Young Mi Sim, Wangho Song, Byung-Ha Yoon, Iksu Byeon, Bang Hyuck Lee, Sang-Ok Kim, Jinhyuk Choi, Insoo Jang, Hyerin Kim, Jin Ok Yang, Kiwon Jang, Sora Kim, Jong-Hwan Kim, Jongbum Jeon, Jaeeun Jung, Seungwoo Hwang, Ji-Hwan Park, Pan-Gyu Kim, Seon-Young Kim, Byungwook Lee
View
abstract
During the last decade, the generation and accumulation of petabase-scale high-throughput sequencing data have resulted in great challenges, including access to human data, as well as transfer, storage, and sharing of enormous amounts of data. To promote data-driven biological research, the Korean government announced that all biological data generated from government-funded research projects should be deposited at the Korea BioData Station (K-BDS), which consists of multiple databases for individual data types. Here, we introduce the Korean Nucleotide Archive (KoNA), a repository of nucleotide sequence data. As of July 2022, the Korean Read Archive in KoNA has collected over 477 TB of raw next-generation sequencing data from national genome projects. To ensure data quality and prepare for international alignment, a standard operating procedure was adopted, which is similar to that of the International Nucleotide Sequence Database Collaboration. The standard operating procedure includes quality control processes for submitted data and metadata using an automated pipeline, followed by manual examination. To ensure fast and stable data transfer, a high-speed transmission system called GBox is used in KoNA. Furthermore, the data uploaded to or downloaded from KoNA through GBox can be readily processed using a cloud computing service called Bio-Express. This seamless coupling of KoNA, GBox, and Bio-Express enhances the data experience, including submission, access, and analysis of raw nucleotide sequences. KoNA not only satisfies the unmet needs for a national sequence repository in Korea but also provides datasets to researchers globally and contributes to advances in genomics. The KoNA is available at https://www.kobic.re.kr/kona/.
Page qzae017
Letter
A Two-color Single-molecule Sequencing Platform and Its Clinical Applications
Fang Chen, Bin Liu, Meirong Chen, Zefei Jiang, Zhiliang Zhou, Ping Wu, Meng Zhang, Huan Jin, Linsen Li, Liuyan Lu, Huan Shang, Lei Liu, Weiyue Chen, Jianfeng Xu, Ruitao Sun, Guangming Wang, Jiao Zheng, Jifang Qi, Bo Yang, Lidong Zeng, Yan Li, Hui Lv, Nannan Zhao, Wen Wang, Jinsen Cai, Yongfeng Liu, Weiwei Luo, Juan Zhang, Yanhua Zhang, Jicai Fan, Haitao Dan, Xuesen He, Wei Huang, Lei Sun
View
abstract
DNA sequencers have become increasingly important research and diagnostic tools over the past 20 years. In this study, we developed a single-molecule desktop sequencer, GenoCare 1600 (GenoCare), which utilizes amplification-free library preparation and two-color sequencing-by-synthesis chemistry, making it more user-friendly compared with previous single-molecule sequencing platforms for clinical use. Using the GenoCare platform, we sequenced an Escherichia coli standard sample and achieved a consensus accuracy exceeding 99.99%. We also evaluated the sequencing performance of this platform in microbial mixtures and coronavirus disease 2019 (COVID-19) samples from throat swabs. Our findings indicate that the GenoCare platform allows for microbial quantitation, sensitive identification of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus, and accurate detection of virus mutations, as confirmed by Sanger sequencing, demonstrating its remarkable potential in clinical application.
近20年来,DNA 测序仪愈发成为重要的研究和诊断工具。我们开发了一款单分子桌面测序仪, GenoCare 1600 (简称GenoCare),采用无扩增文库制备技术与双色边合成边测序化学。相较于已发布的单分子测序平台,这款测序仪器对临床用户更友好。本研究中,我们报道在大肠杆菌标准样品的通过GenoCare测序,下机数据的一致序列的准确率达到99.99%。我们还评估了GenoCare在微生物混合物和新冠患者(COVID-19) 咽拭子样本上的检测性能。研究结果表明: GenoCare平台可以完成微生物的定量,灵敏地识别新冠状病毒,并能准确检测病毒的突变位点(通过sanger测序的验证)。 以上性能显示了GenoCare测序仪器具有优秀的临床应用潜力。
Page qzae006
correction
Correction to: dbDEMC 3.0: Functional Exploration of Differentially Expressed miRNAs in Cancers of Human and Model Organisms
Feng Xu, Yifan Wang, Yunchao Ling, Chenfen Zhou, Haizhou Wang, Andrew E. Teschendorff, Yi Zhao, Haitao Zhao, Yungang He, Guoqing Zhang, Zhen Yang
View
abstract
Page qzae037