Volume: 22, Issue: 3

Mini Review

Computational Strategies and Algorithms for Inferring Cellular Composition of Spatial Transcriptomics Data

Xiuying Liu, Xianwen Ren

Spatial transcriptomics technology has been an essential and powerful method for delineating tissue architecture at the molecular level. However, due to the limitations of the current spatial techniques, the cellular information cannot be directly measured but instead spatial spots typically varying from a diameter of 0.2 to 100 µm are characterized. Therefore, it is vital to apply computational strategies for inferring the cellular composition within each spatial spot. The main objective of this review is to summarize the most recent progresses in estimating the exact cellular proportions for each spatial spot, and to prospect the future directions of this field.

Page qzae057


Original Research

Review and Evaluate the Bioinformatics Analysis Strategies of ATAC-seq and CUT&Tag Data

Siyuan Cheng, Benpeng Miao, Tiandao Li, Guoyan Zhao, Bo Zhang

Efficient and reliable profiling methods are essential to study epigenetics. Tn5, one of the first identified prokaryotic transposases with high DNA-binding and tagmentation efficiency, is widely adopted in different genomic and epigenomic protocols for high-throughputly exploring the genome and epigenome. Based on Tn5, the Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) and the Cleavage Under Targets and Tagmentation (CUT&Tag) were developed to measure chromatin accessibility and detect DNA–protein interactions. These methodologies can be applied to large amounts of biological samples with low-input levels, such as rare tissues, embryos, and sorted single cells. However, fast and proper processing of these epigenomic data has become a bottleneck because massive data production continues to increase quickly. Furthermore, inappropriate data analysis can generate biased or misleading conclusions. Therefore, it is essential to evaluate the performance of Tn5-based ATAC-seq and CUT&Tag data processing bioinformatics tools, many of which were developed mostly for analyzing chromatin immunoprecipitation followed by sequencing (ChIP-seq) data. Here, we conducted a comprehensive benchmarking analysis to evaluate the performance of eight popular software for processing ATAC-seq and CUT&Tag data. We compared the sensitivity, specificity, and peak width distribution for both narrow-type and broad-type peak calling. We also tested the influence of the availability of control IgG input in CUT&Tag data analysis. Finally, we evaluated the differential analysis strategies commonly used for analyzing the CUT&Tag data. Our study provided comprehensive guidance for selecting bioinformatics tools and recommended analysis strategies, which were implemented into Docker/Singularity images for streamlined data analysis.
我们进行了全面的评测分析来评估八种常用软件处理ATAC-seq和CUT&Tag数据的性能,测试了在CUT&Tag数据分析中使用IgG-control的影响, 同时评估了CUT&Tag数据的差异分析策略。我们的研究为ATAC-seq和CUT&Tag数据分析选择生物信息学工具和分析策略提供了全面的指导,构建标准化数据分析流程。

Page qzae054


Original Research

Evaluating Performance of Different RNA Secondary Structure Prediction Programs Using Self-cleaving Ribozymes

Fei Qi, Junjie Chen, Yue Chen, Jianfeng Sun, Yiting Lin, Zipeng Chen, Philipp Kapranov

Accurate identification of the correct, biologically relevant RNA structures is critical to understanding various aspects of RNA biology since proper folding represents the key to the functionality of all types of RNA molecules and plays pivotal roles in many essential biological processes. Thus, a plethora of approaches have been developed to predict, identify, or solve RNA structures based on various computational, molecular, genetic, chemical, or physicochemical strategies. Purely computational approaches hold distinct advantages over all other strategies in terms of the ease of implementation, time, speed, cost, and throughput, but they strongly underperform in terms of accuracy that significantly limits their broader application. Nonetheless, the advantages of these methods led to a steady development of multiple in silico RNA secondary structure prediction approaches including recent deep learning-based programs. Here, we compared the accuracy of predictions of biologically relevant secondary structures of dozens of self-cleaving ribozyme sequences using seven in silico RNA folding prediction tools with tasks of varying complexity. We found that while many programs performed well in relatively simple tasks, their performance varied significantly in more complex RNA folding problems. However, in general, a modern deep learning method outperformed the other programs in the complex tasks in predicting the RNA secondary structures, at least based on the specific class of sequences tested, suggesting that it may represent the future of RNA structure prediction algorithms.
研究问题: RNA分子自我折叠所形成的结构对其行使相关生物学功能至关重要,因此,RNA结构的鉴定是RNA相关研究中的关键。相比于通过实验检测RNA结构,基于计算的RNA结构预测在便利性、速度、成本和通量方面具有显著的优势。目前已有多种多样的RNA结构预测工具被开发出来。然而这些工具在各种RNA结构预测任务中的表现仍有待评估。 研究方法: 本研究基于数十个已知的自切核酶的RNA结构,评估了多个软件在RNA二级结构预测任务中的表现和性能。本研究设计了多种不同复杂度的RNA结构预测任务,包括常规RNA结构预测、假结预测、序列上下文中的RNA结构预测、以及序列变化对RNA结构扰动的预测等。本研究比较了多种基于经典方法的RNA结构预测软件,以及2个最新的基于深度学习的RNA结构预测程序在这些任务中的表现。另外,本研究还评估了进化信息的引入对于RNA结构预测的提升。 主要结果: 1. 在简单的RNA结构预测任务中,各种软件的总体表现非常接近。 2. 在更复杂的RNA结构预测任务中,各种软件的表现有显著差别。其中,基于深度学习的方法,尤其是SPOT-RNA,在大多数复杂任务中表现出了明显的优势。 3. 进化信息的引入有助于RNA结构的预测,但提升并不显著。

Page qzae043


Method

MethylGenotyper: Accurate Estimation of SNP Genotypes and Genetic Relatedness from DNA Methylation Data

Yi Jiang, Minghan Qu, Minghui Jiang, Xuan Jiang, Shane Fernandez, Tenielle Porter, Simon M. Laws, Colin L. Masters, Huan Guo, Shanshan Cheng, Chaolong Wang

Epigenome-wide association studies (EWAS) are susceptible to widespread confounding caused by population structure and genetic relatedness. Nevertheless, kinship estimation is challenging in EWAS without genotyping data. Here, we proposed MethylGenotyper, a method that for the first time enables accurate genotyping at thousands of single nucleotide polymorphisms (SNPs) directly from commercial DNA methylation microarrays. We modeled the intensities of methylation probes near SNPs with a mixture of three beta distributions corresponding to different genotypes and estimated parameters with an expectation-maximization algorithm. We conducted extensive simulations to demonstrate the performance of the method. When applying MethylGenotyper to the Infinium EPIC array data of 4662 Chinese samples, we obtained genotypes at 4319 SNPs with a concordance rate of 98.26%, enabling the identification of 255 pairs of close relatedness. Furthermore, we showed that MethylGenotyper allows for the estimation of both population structure and cryptic relatedness among 702 Australians of diverse ancestry. We also implemented MethylGenotyper in a publicly available R package (https://github.com/Yi-Jiang/MethylGenotyper) to facilitate future large-scale EWAS.
研究问题: 开发基于DNA甲基化数据检测基因型的方法,实现对表观组关联研究(EWAS)中群体遗传结构和亲缘关系等混杂的准确估计。 研究方法: 我们开发了MethylGenotyper方法,对甲基化探针碱基延伸位置的SNP进行基因分型。这些SNP可能会引入与探针序列的错配,从而干扰甲基化强度测量结果。该方法根据甲基化信号强度分布对每个探针重新计算目标SNP的替代等位基因强度比(RAI),并对RAI拟合混合模型,使用期望最大化算法估计参数,实现基因型的检测。 主要结果: 将MethylGenotyper应用于4662名中国人的Illumina EPIC v1.0芯片数据,我们实现了对4319个SNP基因型的准确检测,并基于这些基因型识别出255对近亲关系。此外,MethylGenotyper可以从702名澳大利亚人中准确推断群体遗传结构和亲缘关系。我们已经将MethylGenotyper开发成一个公开可用的R软件包(https://github.com/Yi-Jiang/MethylGenotyper),以促进未来大规模的EWAS。 数据集及算法链接: 东风同济(DFTJ)队列的DNA甲基化数据(MethylGenotyper候选探针):https://ngdc.cncb.ac.cn/omix(存取号为“OMIX006294”)。 AIBL 队列的DNA甲基化数据:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE153712。 AIBL队列的基因型数据可通过向AIBL数据使用协议(https://aibl.csiro.au/awd)申请获取。 MethylGenotyper的R软件包:GitHub(https://github.com/Yi-Jiang/MethylGenotyper)、BioCode(https://ngdc.cncb.ac.cn/biocode/tools/BT007466)。

Page qzae044


Method

BLUPmrMLM: A Fast mrMLM Algorithm in Genome-wide Association Studies

Hong-Fu Li, Jing-Tian Wang, Qiong Zhao, Yuan-Ming Zhang

Multilocus genome-wide association study has become the state-of-the-art tool for dissecting the genetic architecture of complex and multiomic traits. However, most existing multilocus methods require relatively long computational time when analyzing large datasets. To address this issue, in this study, we proposed a fast mrMLM method, namely, best linear unbiased prediction multilocus random-SNP-effect mixed linear model (BLUPmrMLM). First, genome-wide single-marker scanning in mrMLM was replaced by vectorized Wald tests based on the best linear unbiased prediction (BLUP) values of marker effects and their variances in BLUPmrMLM. Then, adaptive best subset selection (ABESS) was used to identify potentially associated markers on each chromosome to reduce computational time when estimating marker effects via empirical Bayes. Finally, shared memory and parallel computing schemes were used to reduce the computational time. In simulation studies, BLUPmrMLM outperformed GEMMA, EMMAX, mrMLM, and FarmCPU as well as the control method (BLUPmrMLM with ABESS removed), in terms of computational time, power, accuracy for estimating quantitative trait nucleotide positions and effects, false positive rate, false discovery rate, false negative rate, and F1 score. In the reanalysis of two large rice datasets, BLUPmrMLM significantly reduced the computational time and identified more previously reported genes, compared with the aforementioned methods. This study provides an excellent multilocus model method for the analysis of large-scale and multiomic datasets. The software mrMLM v5.1 is available at BioCode (https://ngdc.cncb.ac.cn/biocode/tool/BT007388) or GitHub (https://github.com/YuanmingZhang65/mrMLM).
研究问题: 为解决多位点关联分析mrMLM方法分析多组学大数据需要时间较长的问题,研制mrMLM的快速BLUPmrMLM算法。 研究方案: 1、利用标记效应BLUP值及其方差,将mrMLM的全基因组单标记效应估计和统计检验替换为向量化的Wald检验; 2、利用自适应最佳子集选择ABESS法筛选每条染色体上的潜在关联标记; 3、使用共享内存和并行计算编程技术; 4、用模拟数据和真实数据评估新方法优劣。 主要结果1:BLUPmrMLM算法分析1439个水稻品种110万标记与5个性状数据集和2261个水稻品种101万标记与2个性状数据集分别用了0.92和0.26小时,与EMMAX相当; 主要结果2:在模拟研究中,BLUPmrMLM算法的QTN检测功效比mrMLM高10%左右,大大高于GEMMA、EMMAX和FarmCPU方法,在QTN的位置和效应估计精度和F1得分方面其趋势一致;在两个水稻大数据集关联分析中,鉴定到更多前人报道的已知基因。 算法链接: https://ngdc.cncb.ac.cn/biocode/tools/BT007388 (BioCode) https://github.com/YuanmingZhang65/mrMLM (GitHub)

Page qzae020


Method

Inter3D: Capture of TAD Reorganization Endows Variant Patterns of Gene Transcription

Tianyi Ding, Shaliu Fu, Xiaoyu Zhang, Fan Yang, Jixing Zhang, Haowen Xu, Jiaqi Yang, Chaoqun Chen, Yibing Shi, Yiran Bai, Wannian Li, Xindi Chang, Shanjin Wang, Chao Zhang, Qi Liu, He Zhang

Topologically associating domain (TAD) reorganization commonly occurs in the cell nucleus and contributes to gene activation and inhibition through the separation or fusion of adjacent TADs. However, functional genes impacted by TAD alteration and the underlying mechanism of TAD reorganization regulating gene transcription remain to be fully elucidated. Here, we first developed a novel approach termed Inter3D to specifically identify genes regulated by TAD reorganization. Our study revealed that the segregation of TADs led to the disruption of intrachromosomal looping at the myosin light chain 12B (MYL12B) locus, via the meticulous reorganization of TADs mediating epigenomic landscapes within tumor cells, thereby exhibiting a significant correlation with the down-regulation of its transcriptional activity. Conversely, the fusion of TADs facilitated intrachromosomal interactions, suggesting a potential association with the activation of cytochrome P450 family 27 subfamily B member 1 (CYP27B1). Our study provides comprehensive insight into the capture of TAD rearrangement-mediated gene loci and moves toward understanding the functional role of TAD reorganization in gene transcription. The Inter3D pipeline developed in this study is freely available at https://github.com/bm2-lab/inter3D and https://ngdc.cncb.ac.cn/biocode/tool/BT7399.
研究问题: 细胞核中普遍发生的拓扑相关结构域(topologically associating domain,TAD)重构现象可通过分离或融合相邻的TADs,影响基因的激活和抑制。然而,如何识别受TAD改变影响的功能基因以及TAD重构机制在基因转录调控中的作用仍未完全阐明。 研究方法: 通过联动分析Hi-C、CTCF ChIP-seq、ATAC-seq和RNA-seq等高通量多组学数据,设计并开发了一种名为Inter3D的新方法,能够特异性地识别TAD重构介导的功能染色质环及其所调控的基因位点。 主要成果1: 联动三维多组学数据创建了Inter3D新方法,全面捕获了TAD重构介导的功能染色质环及其所调控的基因位点。 主要成果2: 构建了视网膜母细胞瘤的全基因组染色质相互作用网络,表征了独特的高维染色质结构,并精确挖掘出TAD之间和TAD内部的染色质相互作用所调控的编码基因。 主要成果3: 利用3C等实验方法,揭示TAD分离在MYL12B基因位点阻碍了其启动子和增强子之间形成染色质环,显著下调MYL12B表达。此外,还发现TAD融合促进了CYP27B1特异性染色质环的形成,激活CYP27B1表达。 数据集及算法链接: GEO数据库(https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE230798) GDNC数据库(https://ngdc.cncb.ac.cn/gsa-human/browse/HRA007411) GitHub(https://github.com/bm2-lab/inter3D) Zenodo (https://zenodo.org/record/7857813#.ZEYooexByjB) Figshare (https://figshare.com/projects/inter3D_ucsc_genome_tracks/165706) BioCode (https://ngdc.cncb.ac.cn/biocode/tools/BT007399/releases/v1.0.0)

Page qzae034


Method

MSIsensor-RNA: Microsatellite Instability Detection for Bulk and Single-cell Gene Expression Data

Peng Jia, Xuanhao Yang, Xiaofei Yang, Tingjie Wang, Yu Xu, Kai Ye

Microsatellite instability (MSI) is an indispensable biomarker in cancer immunotherapy. Currently, MSI scoring methods by high-throughput omics methods have gained popularity and demonstrated better performance than the gold standard method for MSI detection. However, the MSI detection method on expression data, especially single-cell expression data, is still lacking, limiting the scope of clinical application and prohibiting the investigation of MSI at a single-cell level. Herein, we developed MSIsensor-RNA, an accurate, robust, adaptable, and standalone software to detect MSI status based on expression values of MSI-associated genes. We demonstrated the favorable performance and promise of MSIsensor-RNA in both bulk and single-cell gene expression data in multiplatform technologies including RNA sequencing (RNA-seq), microarray, and single-cell RNA-seq. MSIsensor-RNA is a versatile, efficient, and robust method for MSI status detection from both bulk and single-cell gene expression data in clinical studies and applications. MSIsensor-RNA is available at https://github.com/xjtu-omics/msisensor-rna.
研究背景: 微卫星不稳定性(Microsatellite instability, MSI)是由于恶性肿瘤中DNA错配修复系统受损,从而导致基因组微卫星区域发生超突变的一种分子表型。MSI与肿瘤的发生、发展及预后密切相关,是免疫治疗病人选择、疗效预测的重要分子标记物。当前,临床上使用的两种MSI的金标准检测方法分别是MSI-PCR和MSI-IHC,但都费时费力且成本较高。近年来,研究人员开发了针对基因组测序数据的系列方法,但是大都针对DNA测序数据,无法应用于转录组数据,特别是单细胞转录组数据,这在一定程度上限制了MSI的广泛应用。 研究方案: 1、 提出基于基因表达数据的MSI检测方法MSIsensor-RNA; 2、 建立针对不同癌症类型特征基因组选择方法; 3、 设计机器学习算法,预测不同平台表达数据的MSI样本。 主要结果: 经过广泛的测试验证,MSIsensor-RNA表现出与传统MSI检测方法高度一致的性能。该方法不仅适用于Microarray和RNA-seq样本,还可以应用于单细胞转录组数据,为MSI的广泛应用提供了新的可能性。

Page qzae004


Method

SuperFeat: Quantitative Feature Learning from Single-cell RNA-seq Data Facilitates Drug Repurposing

Jianmei Zhong, Junyao Yang, Yinghui Song, Zhihua Zhang, Chunming Wang, Renyang Tong, Chenglong Li, Nanhui Yu, Lianhong Zou, Sulai Liu, Jun Pu, Wei Lin

In this study, we devised a computational framework called Supervised Feature Learning and Scoring (SuperFeat) which enables the training of a machine learning model and evaluates the canonical cellular statuses/features in pathological tissues that underlie the progression of disease. This framework also enables the identification of potential drugs that target the presumed detrimental cellular features. This framework was constructed on the basis of an artificial neural network with the gene expression profiles serving as input nodes. The training data comprised single-cell RNA sequencing datasets that encompassed the specific cell lineage during the developmental progression of cell features. A few models of the canonical cancer-involved cellular statuses/features were tested by such framework. Finally, we illustrated the drug repurposing pipeline, utilizing the training parameters derived from the adverse cellular statuses/features, which yielded successful validation results both in vitro and in vivo. SuperFeat is accessible at https://github.com/weilin-genomics/rSuperFeat.

Page qzae036


Database

GenBase: A Nucleotide Sequence Database

Congfan Bu, Xinchang Zheng, Xuetong Zhao, Tianyi Xu, Xue Bai, Yaokai Jia, Meili Chen, Lili Hao, Jingfa Xiao, Zhang Zhang, Wenming Zhao, Bixia Tang, Yiming Bao

The rapid advancement of sequencing technologies poses challenges in managing the large volume and exponential growth of sequence data efficiently and on time. To address this issue, we present GenBase (https://ngdc.cncb.ac.cn/genbase), an open-access data repository that follows the International Nucleotide Sequence Database Collaboration (INSDC) data standards and structures, for efficient nucleotide sequence archiving, searching, and sharing. As a core resource within the National Genomics Data Center (NGDC) of the China National Center for Bioinformation (CNCB; https://ngdc.cncb.ac.cn), GenBase offers bilingual submission pipeline and services, as well as local submission assistance in China. GenBase also provides a unique Excel format for metadata description and feature annotation of nucleotide sequences, along with a real-time data validation system to streamline sequence submissions. As of April 23, 2024, GenBase received 68,251 nucleotide sequences and 689,574 annotated protein sequences across 414 species from 2319 submissions. Out of these, 63,614 (93%) nucleotide sequences and 620,640 (90%) annotated protein sequences have been released and are publicly accessible through GenBase’s web search system, File Transfer Protocol (FTP), and Application Programming Interface (API). Additionally, in collaboration with INSDC, GenBase has constructed an effective data exchange mechanism with GenBank and started sharing released nucleotide sequences. Furthermore, GenBase integrates all sequences from GenBank with daily updates, demonstrating its commitment to actively contributing to global sequence data management and sharing.
基因序列和注释信息(包括DNA、RNA和蛋白序列信息)是支撑基因功能研究的核心基础数据之一。伴随生物学的迅猛发展,在过去几十年中,我国生命科学领域的科学家产出了海量的基因序列数据,其中许多已经提交到了国际核酸序列共享联盟(International Nucleotide Sequence Database Collaboration,INSDC)。目前,中国和其他国家/地区的研究人员严重依赖INSDC进行序列提交和检索。同时,测序技术的快速发展导致序列数据量的快速增加,这为及时有效的提交和共享带来了巨大挑战。为保障我国基因序列数据的主权和安全,满足我国科研人员在基因序列数据汇交、管理和共享过程中的现实需求,对标美国国家生物信息中心NCBI的GenBank数据库,我们完成了基因序列数据库GenBase的开发(https://ngdc.cncb.ac.cn/genbase/)。 GenBase是国家基因组科学数据中心的核心资源,它采用GenBank的数据模型,通过在线的双语提交系统支持提交多种数据类型,包括基因组DNA、mRNA、ncRNA,以及来源于细胞器、病毒、质粒和噬菌体的核酸序列。此外,GenBase集成了所有来自GenBank的序列,并保持每日更新,提供免费且公开可访问的数据,支持国际数据集的分发和共享,促进中国研究人员的数据访问。

Page qzae047


Database

SMARTdb: An Integrated Database for Exploring Single-cell Multi-omics Data of Reproductive Medicine

Zekai Liu, Zhen Yuan, Yunlei Guo, Ruilin Wang, Yusheng Guan, Zhanglian Wang, Yunan Chen, Tianlu Wang, Meining Jiang, Shuhui Bian

Single-cell multi-omics sequencing has greatly accelerated reproductive research in recent years, and the data are continually growing. However, utilizing these data resources is challenging for wet-lab researchers. A comprehensive platform for exploring single-cell multi-omics data related to reproduction is urgently needed. Here, we introduce the single-cell multi-omics atlas of reproduction (SMARTdb), an integrative and user-friendly platform for exploring molecular dynamics of reproductive development, aging, and disease, which covers multi-omics, multi-species, and multi-stage data. We curated and analyzed single-cell transcriptomic and epigenomic data of over 2.0 million cells from 6 species across the entire lifespan. A series of powerful functionalities are provided, such as “Query gene expression”, “DIY expression plot”, “DNA methylation plot”, and “Epigenome browser”. With SMARTdb, we found that the male germ cell-specific expression pattern of RPL39L and RPL10L is conserved between human and other model animals. Moreover, DNA hypomethylation and open chromatin may collectively regulate the specific expression pattern of RPL39L in both male and female germ cells. In summary, SMARTdb is a powerful platform for convenient data mining and gaining novel insights into reproductive development, aging, and disease. SMARTdb is publicly available at https://smart-db.cn.
近年来,单细胞多组学技术极大地促进了生殖医学的研究,同时也产生了海量宝贵的多组学数据资源。然而,由于多组学数据分析十分具有挑战性,阻碍了广大无生信分析经验的研究者对于数据的进一步使用和挖掘。因此,亟需一个综合性、互动性的在线平台,搭建起用户和海量数据资源之间的桥梁。为此,我们构建了SMARTdb(https://smart-db.cn),一个覆盖生命全周期、多物种的生殖医学单细胞多组学综合性探索平台。 我们整理和分析了近年来生殖医学相关的单细胞多组学数据,主要具有3个特征: (1)生命全周期。覆盖早期胚胎、胎儿、婴儿、青春期、成年、衰老等生命周期主要的阶段,包含超过120个具体的时间点。此外,SMARTdb还纳入了许多人类男性不育的数据,为探究生殖发育、衰老和疾病提供宝贵资源。 (2)多物种。目前包括人、猴子、小鼠、猪、水牛、山羊等6个物种。 (3)单细胞多组学。目前纳入了超过200万个细胞的单细胞转录组、单细胞DNA甲基化和单细胞染色质可及性数据。 基于SMARTdb平台,用户通过点击鼠标即可对海量生殖医学相关的单细胞多组学数据进行探索。SMARTdb的出现为广大无生信分析经验的研究者搭起了通向单细胞世界的桥梁,帮助他们更有效、更便捷地使用海量数据资源,并加速自己的研究发现。

Page qzae005


Database

SCancerRNA: Expression at the Single-cell Level and Interaction Resource of Non-coding RNA Biomarkers for Cancers

Hongzhe Guo, Liyuan Zhang, Xinran Cui, Liang Cheng, Tianyi Zhao, Yadong Wang

Non-coding RNAs (ncRNAs) participate in multiple biological processes associated with cancers as tumor suppressors or oncogenic drivers. Due to their high stability in plasma, urine, and many other fluids, ncRNAs have the potential to serve as key biomarkers for early diagnosis and screening of cancers. During cancer progression, tumor heterogeneity plays a crucial role, and it is particularly important to understand the gene expression patterns of individual cells. With the development of single-cell RNA sequencing (scRNA-seq) technologies, uncovering gene expression in different cell types for human cancers has become feasible by profiling transcriptomes at the cellular level. However, a well-organized and comprehensive online resource that provides access to the expression of genes corresponding to ncRNA biomarkers in different cell types at the single-cell level is not available yet. Therefore, we developed the SCancerRNA database to summarize experimentally supported data on long ncRNA, microRNA, PIWI-interacting RNA, small nucleolar RNA, and circular RNA biomarkers, as well as data on their differential expression at the cellular level. Furthermore, we collected biological functions and clinical applications of biomarkers to facilitate the application of ncRNA biomarkers to cancer diagnosis, as well as the monitoring of progression and targeted therapies. SCancerRNA also allows users to explore interaction networks of different types of ncRNAs, and build computational models in the future. SCancerRNA is freely accessible at http://www.scancerrna.com/BioMarker.

Page qzae023


Database

AVM: A Manually Curated Database of Aerosol-transmitted Virus Mutations, Human Diseases, and Drugs

Lan Mei, Yaopan Hou, Jiajun Zhou, Yetong Chang, Yuwei Liu, Di Wang, Yunpeng Zhang, Shangwei Ning, Xia Li

Aerosol-transmitted viruses possess strong infectivity and can spread over long distances, earning the difficult-to-control title. They cause various human diseases and pose serious threats to human health. Mutations can increase the transmissibility and virulence of the strains, reducing the protection provided by vaccines and weakening the efficacy of antiviral drugs. In this study, we established a manually curated database (termed AVM) to store information on aerosol-transmitted viral mutations (VMs). The current version of the AVM contains 42,041 VMs (including 2613 immune escape mutations), 45 clinical information datasets, and 407 drugs/antibodies/vaccines. Additionally, we recorded 88 human diseases associated with viruses and found that the same virus can target multiple organs in the body, leading to diverse diseases. Furthermore, the AVM database offers a straightforward user interface for browsing, retrieving, and downloading information. This database is a comprehensive resource that can provide timely and valuable information on the transmission, treatment, and diseases caused by aerosol-transmitted viruses (http://www.bio-bigdata.center/AVM).
研究问题: 关于气溶胶传播病毒突变的系统汇编 研究方法: 采用Jones和Brosseau标准确定通过气溶胶传播的病毒种类。基于对公共数据库中病毒突变(VMs)信息的广泛收集及挖掘,开发了经实验验证的人类气溶胶传播病毒数据库AVM 主要结果: 1.包含VMs对蛋白质功能的影响、药物信息、疾病、临床数据和免疫逃逸的信息, 并为气溶胶传播病毒的预防和治疗提供参考。 2. AVM提供实验证实的与病毒传播机制或致病机制相关的突变信息,有利于研究病毒的传播机制和致病性,进一步探索病毒的调控机制,为研制特异性抗病毒药物提供可能。 3.AVM提供了免疫逃逸和耐药性位点的数据,以及对抗体、疫苗或药物的突变耐药性信息,可以为专业人员选择更科学、更合理的治疗策略提供参考。

Page qzae041


Database

Nphos: Database and Predictor of Protein N-phosphorylation

Ming-Xiao Zhao, Ruo-Fan Ding, Qiang Chen, Junhua Meng, Fulai Li, Songsen Fu, Biling Huang, Yan Liu, Zhi-Liang Ji, Yufen Zhao

Protein N-phosphorylation is widely present in nature and participates in various biological processes. However, current knowledge on N-phosphorylation is extremely limited compared to that on O-phosphorylation. In this study, we collected 11,710 experimentally verified N-phosphosites of 7344 proteins from 39 species and subsequently constructed the database Nphos to share up-to-date information on protein N-phosphorylation. Upon these substantial data, we characterized the sequential and structural features of protein N-phosphorylation. Moreover, after comparing hundreds of learning models, we chose and optimized gradient boosting decision tree (GBDT) models to predict three types of human N-phosphorylation, achieving mean area under the receiver operating characteristic curve (AUC) values of 90.56%, 91.24%, and 92.01% for pHis, pLys, and pArg, respectively. Meanwhile, we discovered 488,825 distinct N-phosphosites in the human proteome. The models were also deployed in Nphos for interactive N-phosphosite prediction. In summary, this work provides new insights and points for both flexible and focused investigations of N-phosphorylation. It will also facilitate a deeper and more systematic understanding of protein N-phosphorylation modification by providing a data and technical foundation. Nphos is freely available at http://www.bio-add.org/Nphos/ and http://ppodd.org.cn/Nphos/.
研究问题 蛋白质的氮磷酸化修饰(Protein N-phosphorylated modification)系蛋白质上组氨酸(Histidine,His)、赖氨酸(Lysine,Lys)和精氨酸(Arginine,Arg)残基发生磷酸化修饰的一种蛋白质翻译后修饰(Post-translational modifications,PTMs)类型。其在原核和真核生物体内普遍存在,并发挥着重要生物学作用。然而由于磷酰胺键的不稳定及研究技术匮乏等原因,蛋白质N-磷酸化修饰的研究进展一直比较缓慢。此外,对蛋白质N-磷酸化修饰数据的系统收集、集中分析及预测工具的缺乏为其数据重用带来许多挑战。因此,搭建一个蛋白质N-磷酸化数据库及位点预测平台能够为蛋白质N-磷酸化的集中研究提供新的平台和机遇。 研究创新点 1. 挖掘了大量的N-磷酸化修饰位点信息,揭示了N-磷酸化的序列与结构特征; 2. 构建了人三种N-磷酸化修饰位点预测模型并与已知模型进行比较,并在人全蛋白质组水平,对潜在的N-磷酸化修饰位点进行预测; 3. 构建了一个N-磷酸化数据库及在线预测平台Nphos,绘制了目前最全面的N-磷酸化修饰图谱。 研究方案 本研究中,我们通过文献检索及原始质谱搜库等方法在39个物种中收集到7344个蛋白质的11,710个经实验验证的N-磷酸化位点信息,并将这些数据搭建为蛋白质N-磷酸化数据库-Nphos(http://www.bio-add.org/Nphos/和http://ppodd.org.cn/Nphos/)。基于Nphos数据库,我们分析了蛋白质N-磷酸化的序列结构特征;进一步通过比较数百个机器学习模型,优化选择出以梯度增强决策树(GBDT)为模型的pHis、pLys和pArg位点预测器,且预测准确率均可达到90%以上。基于最终训练得到的预测器,我们在人类基因组中发现488,825个潜在的蛋白质N-磷酸化位点。目前三种形式的预测器也已被部署在Nphos网站中。

Page qzae032


Web Server

CBioProfiler: A Web and Standalone Pipeline for Cancer Biomarker and Subtype Characterization

Xiaoping Liu, Zisong Wang, Hongjie Shi, Sheng Li, Xinghuan Wang

Cancer is a leading cause of death worldwide, and the identification of biomarkers and subtypes that can predict the long-term survival of cancer patients is essential for their risk stratification, treatment, and prognosis. However, there are currently no standardized tools for exploring cancer biomarkers or subtypes. In this study, we introduced Cancer Biomarker and subtype Profiler (CBioProfiler), a web server and standalone application that includes two pipelines for analyzing cancer biomarkers and subtypes. The cancer biomarker pipeline consists of five modules for identifying and annotating cancer survival-related biomarkers using multiple survival-related machine learning algorithms. The cancer subtype pipeline includes three modules for data preprocessing, subtype identification using multiple unsupervised machine learning methods, and subtype evaluation and validation. CBioProfiler also includes CuratedCancerPrognosisData, a novel R package that integrates reviewed and curated gene expression and clinical data from 268 studies. These studies cover 43 common blood and solid tumors and draw upon 47,686 clinical samples. The web server is available at https://www.cbioprofiler.com/ and https://cbioprofiler.znhospital.cn/CBioProfiler/, and the standalone app and source code can be found at https://github.com/liuxiaoping2020/CBioProfiler.

Page qzae045