Article Online - Genomics, Proteomics & Bioinformatics

Volume: 21, Issue: 5

Editorial

Toward Inclusiveness and Thoroughness: A Paradigm Shift from More-ever-omics to Holovivology

Jun Yu

View abstract

no abstract

Page 895-896

Download 1138

Historical Note

A Historic Retrospective on the Early Bioinformatics Research in China

Runsheng Chen

View abstract

no abstract

Page 897-899

Download 1079

Perspective

From BIG Data Center to China National Center for Bioinformation

Yiming Bao, Yongbiao Xue

View abstract

no abstract

Page 900-903

Download 995

Perspective

Toward A New Paradigm of Genomics Research: —Celebration of the 20th Anniversary of Beijing Institute of Genomics

Zhang Zhang, Songnian Hu, Jun Yu

View abstract

Twenty years after the completion and forty years after the proposal of the Human Genome Project (HGP), genomics, together with its twin field — bioinformatics, has entered a new paradigm, where its bioscience-related, discipline-centric applications have been creating many new research frontiers. Beijing Institute of Genomics (BIG), now also known as China National Center for Bioinformation (CNCB), will play key roles in supporting and participating in these frontier research activities. On the 20th anniversary of the establishment of BIG, we provide a brief retrospective of its historic events and ascertain strategic research directions with a broader vision for future genomics, where digital genome, digital medicine, and digital health are so structured to meet the needs of human life and healthcare, as well as their related metaverses.

Page 904-909

Download 1715

Preview

Revolutionizing Antibody Discovery: An Innovative AI Model for Generating Robust Libraries

Yaojun Wang, Shiwei Sun

View abstract

no abstract

Page 910-912

Download 1064

Review Article

Protein Structure Prediction: Challenges, Advances, and the Shift of Research Paradigms

Bin Huang, Lupeng Kong, Chao Wang, Fusong Ju, Qi Zhang, Jianwei Zhu, Tiansu Gong, Haicang Zhang, Chungong Yu, Wei-Mou Zheng, Dongbo Bu

View abstract

Protein structure prediction is an interdisciplinary research topic that has attracted researchers from multiple fields, including biochemistry, medicine, physics, mathematics, and computer science. These researchers adopt various research paradigms to attack the same structure prediction problem: biochemists and physicists attempt to reveal the principles governing protein folding; mathematicians, especially statisticians, usually start from assuming a probability distribution of protein structures given a target sequence and then find the most likely structure, while computer scientists formulate protein structure prediction as an optimization problem — finding the structural conformation with the lowest energy or minimizing the difference between predicted structure and native structure. These research paradigms fall into the two statistical modeling cultures proposed by Leo Breiman, namely, data modeling and algorithmic modeling. Recently, we have also witnessed the great success of deep learning in protein structure prediction. In this review, we present a survey of the efforts for protein structure prediction. We compare the research paradigms adopted by researchers from different fields, with an emphasis on the shift of research paradigms in the era of deep learning. In short, the algorithmic modeling techniques, especially deep neural networks, have considerably improved the accuracy of protein structure prediction; however, theories interpreting the neural networks and knowledge on protein folding are still highly desired.

要点介绍：蛋白质结构预测是一个跨学科的研究课题，吸引了生物化学、医学、物理学、数学和计算机科学等多个领域的研究者。这些研究者采用不同的研究范式来解决相同的结构预测问题：生物化学家和物理学家试图揭示控制蛋白质折叠的原理；数学家，尤其是统计学家，通常从假设在给定目标序列的情况下蛋白质结构的概率分布开始，并找到最可能的结构；而计算机科学家将蛋白质结构预测形式化为优化问题——找到能量最低的结构构象，或者最小化预测结构与原生结构之间的差异。这些研究范式属于L. Breiman提出的两种统计建模文化，即数据建模和算法建模。最近，我们也见证了深度学习在蛋白质结构预测中的巨大成功。研究方法：在本文中，我们总结了蛋白质结构预测方面的研究成果。我们比较了来自不同领域的研究者采用的研究范式，并强调了在深度学习时代的研究范式转变。总的来说，算法建模技术，特别是深度神经网络，显著提高了蛋白质结构预测的准确性；然而，解释神经网络和了解蛋白质折叠的理论仍然是非常必要的。

Page 913-925

Download 1377

Review Article

Decoding Human Biology and Disease Using Single-cell Omics Technologies

Qiang Shi, Xueyan Chen, Zemin Zhang

View abstract

Over the past decade, advances in single-cell omics (SCO) technologies have enabled the investigation of cellular heterogeneity at an unprecedented resolution and scale, opening a new avenue for understanding human biology and disease. In this review, we summarize the developments of sequencing-based SCO technologies and computational methods, and focus on considerable insights acquired from SCO sequencing studies to understand normal and diseased properties, with a particular emphasis on cancer research. We also discuss the technological improvements of SCO and its possible contribution to fundamental research of the human, as well as its great potential in clinical diagnoses and personalized therapies of human disease.

自2009年汤富酬教授首次开发出单细胞转录组测序技术以来，各种单细胞组学（single-cell omics, SCO）测序方法已经广泛应用于揭示细胞在基因组、表观基因组、转录组和蛋白质组等分子层面的特征。SCO技术和相关计算方法的迅速发展在癌症、发育、免疫、再生医学和植物等研究领域都发挥了极其重要的推动作用。SCO测序技术也因此先后两次被《自然方法》杂志评选为年度方法。围绕着SCO技术，该综述论文主要涵盖4方面内容：内容1：技术发展。总结和比较了单细胞基因组、转录组、表观基因组和单细胞多组学等方法的特征和异同。内容2：计算方法。介绍了单细胞转录组和多组学数据分析的基本逻辑、流程和代表性算法。内容3：应用成果。探讨了SCO技术在生理和疾病状态下揭示的细胞异质性、肿瘤微环境、免疫治疗新靶点、新冠肺炎研究以及海量数据资源等方面的代表性成果。内容4：未来展望。讨论了SCO自身技术发展和其在基础研究和临床应用的前景，强调了其在疾病诊断和个性化治疗等领域的巨大潜力。

Page 926-949

Download 4817

Review Article

Omics Views of Mechanisms for Cell Fate Determination in Early Mammalian Development

Lin-Fang Ju, Heng-Ji Xu, Yun-Gui Yang, Ying Yang

View abstract

During mammalian preimplantation development, a totipotent zygote undergoes several cell cleavages and two rounds of cell fate determination, ultimately forming a mature blastocyst. Along with compaction, the establishment of apicobasal cell polarity breaks the symmetry of an embryo and guides subsequent cell fate choice. Although the lineage segregation of the inner cell mass (ICM) and trophectoderm (TE) is the first symbol of cell differentiation, several molecules have been shown to bias the early cell fate through their inter-cellular variations at much earlier stages, including the 2- and 4-cell stages. The underlying mechanisms of early cell fate determination have long been an important research topic. In this review, we summarize the molecular events that occur during early embryogenesis, as well as the current understanding of their regulatory roles in cell fate decisions. Moreover, as powerful tools for early embryogenesis research, single-cell omics techniques have been applied to both mouse and human preimplantation embryos and have contributed to the discovery of cell fate regulators. Here, we summarize their applications in the research of preimplantation embryos, and provide new insights and perspectives on cell fate regulation.

在哺乳动物胚胎着床前发育过程中，一个全能的受精卵经历多次细胞分裂和两轮细胞命运决定，最终形成一个由内细胞团（inner cell mass, ICM）和营养外胚层（trophectoderm, TE）组成的成熟囊胚。尽管TE和ICM谱系分离是第一次细胞分化的标志，但越来越多的证据表明，在胚胎发育更早时期（包括2细胞和4细胞时期）出现的细胞间异质性也参与了早期胚胎细胞命运的调控。精确的细胞命运决定通过促进各种全能的细胞状态来决定细胞不同的发育轨迹，因此追踪胚胎发育过程中最早的细胞命运决定一直是该领域的热点问题，早期细胞命运决定的潜在机制是该领域研究的重要方向。在这篇综述中，我们总结了早期胚胎命运决定中发生的分子事件以及在细胞命运决定中的调节作用，包括桑葚胚囊胚时期的细胞谱系分化、8-16细胞阶段的胚胎顶端—基底极性建立、2细胞和4细胞阶段的细胞异质性对细胞命运决定的调控作用机制。此外，单细胞组学技术已被用作早期胚胎调控研究的有力工具，利用单细胞组学技术解析了小鼠和人类着床前胚胎的程序性分子事件和许多关于细胞命运决定的发现，我们对其在着床前胚胎研究领域的应用进行了全面总结，并结合现有组学数据提出了新的见解和未来展望。

Page 950-961

Download 1303

Review Article

Patient Assessment and Therapy Planning Based on Homologous Recombination Repair Deficiency

Wenbin Li, Lin Gao, Xin Yi, Shuangfeng Shi, Jie Huang, Leming Shi, Xiaoyan Zhou, Lingying Wu, Jianming Ying

View abstract

Defects in genes involved in the DNA damage response cause homologous recombination repair deficiency (HRD). HRD is found in a subgroup of cancer patients for several tumor types, and it has a clinical relevance to cancer prevention and therapies. Accumulating evidence has identified HRD as a biomarker for assessing the therapeutic response of tumor cells to poly(ADP-ribose) polymerase inhibitors and platinum-based chemotherapies. Nevertheless, the biology of HRD is complex, and its applications and the benefits of different HRD biomarker assays are controversial. This is primarily due to inconsistencies in HRD assessments and definitions (gene-level tests, genomic scars, mutational signatures, or a combination of these methods) and difficulties in assessing the contribution of each genomic event. Therefore, we aim to review the biological rationale and clinical evidence of HRD as a biomarker. This review provides a blueprint for the standardization and harmonization of HRD assessments.

研究问题参与DNA损伤反应的基因缺陷导致同源重组修复缺陷（HRD）。HRD在几个肿瘤类型的癌症患者亚群中被发现，它在癌症预防和治疗中具有临床意义。越来越多的证据表明，HRD是评估肿瘤细胞对聚（ADP-核糖）聚合酶抑制剂和铂类化疗药物治疗反应的生物标志物。然而，HRD的生物学特性是复杂的，它的应用和不同的HRD生物标志物检测的优劣性是有争议的。这主要是由于HRD评估和定义的不一致（基因水平测试、基因组疤痕、突变特征或这些方法的组合）以及评估每个基因组事件的贡献的困难。因此，我们旨在回顾HRD作为生物标志物的生物学原理和临床证据。这一回顾为HRD评估的标准化和统一化提供了蓝图。研究方法通过对现有相关文献的回顾和梳理，为HRD评估的标准化提供理论依据。主要结果本研究回顾了HRD的定义、HRD评估的方法、HRD检测的临床应用、HRD检测的局限性，优化和标准化、以及HRD测试作为癌症诊断和预后生物标志物的价值，建立HRD标准化的理论基础。

Page 962-975

Download 1208

Original Research

HPC-Atlas: Computationally Constructing A Comprehensive Atlas of Human Protein Complexes

Yuliang Pan, Ruiyi Li, Wengen Li, Liuzhenghao Lv, Jihong Guan, Shuigeng Zhou

View abstract

A fundamental principle of biology is that proteins tend to form complexes to play important roles in the core functions of cells. For a complete understanding of human cellular functions, it is crucial to have a comprehensive atlas of human protein complexes. Unfortunately, we still lack such a comprehensive atlas of experimentally validated protein complexes, which prevents us from gaining a complete understanding of the compositions and functions of human protein complexes, as well as the underlying biological mechanisms. To fill this gap, we built Human Protein Complexes Atlas (HPC-Atlas), as far as we know, the most accurate and comprehensive atlas of human protein complexes available to date. We integrated two latest protein interaction networks, and developed a novel computational method to identify nearly 9000 protein complexes, including many previously uncharacterized complexes. Compared with the existing methods, our method achieved outstanding performance on both testing and independent datasets. Furthermore, with HPC-Atlas we identified 751 severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)-affected human protein complexes, and 456 multifunctional proteins that contain many potential moonlighting proteins. These results suggest that HPC-Atlas can serve as not only a computing framework to effectively identify biologically meaningful protein complexes by integrating multiple protein data sources, but also a valuable resource for exploring new biological findings. The HPC-Atlas webserver is freely available at http://www.yulpan.top/HPC-Atlas.

蛋白质复合物是由一组具有相互作用的蛋白质构成，其承担了细胞中许多重要、复杂的生物功能，参与了生物体内几乎所有的生命活动过程。然而已知的人类蛋白质复合物数量较少，阻碍了我们对人类蛋白质复合物的组成、功能及其相关生物学机制的认知。鉴于此，我们基于计算方法构建了人类蛋白质复合物图谱，简称HPC-Atlas（Human Protein Complexes Atlas）。通过集成了两个最新的人类蛋白质相互作用网络，并开发了一种新的计算方法，在集成后的蛋白质相互作用网络上预测得到近9000个人类蛋白质复合物，其中包含许多潜在的真实蛋白质复合物。与现有的工作相比，我们方法在测试集和独立测试集上都取得了卓越的预测性能。此外，基于构造的人类蛋白质复合物图谱，我们还鉴定得到了751个与SARS-CoV-2相关的人类蛋白质复合物，以及456个多功能蛋白质。这表明HPC-Atlas不仅可以作为一个通用计算框架，用于预测具有生物学意义的人类蛋白质复合物，而且还可以作为挖掘分析相关生物信息的宝贵资源。HPC-Atlas在线网址链接为：http://www.yulpan.top/HPC-Atlas。

Page 976-990

Download 1096

Original Research

Non-small Cell Lung Cancer Epigenomes Exhibit Altered DNA Methylation in Smokers and Never-smokers

Jennifer A. Karlow, Erica C. Pehrsson, Xiaoyun Xing, Mark Watson, Siddhartha Devarakonda, Ramaswamy Govindan, Ting Wang

View abstract

Epigenetic alterations are widespread in cancer and can complement genetic alterations to influence cancer progression and treatment outcome. To determine the potential contribution of DNA methylation alterations to tumor phenotype in non-small cell lung cancer (NSCLC) in both smoker and never-smoker patients, we performed genome-wide profiling of DNA methylation in 17 primary NSCLC tumors and 10 matched normal lung samples using the complementary assays, methylated DNA immunoprecipitation sequencing (MeDIP-seq) and methylation sensitive restriction enzyme sequencing (MRE-seq). We reported recurrent methylation changes in the promoters of several genes, many previously implicated in cancer, including FAM83A and SEPT9 (hypomethylation), as well as PCDH7, NKX2-1, and SOX17 (hypermethylation). Although many methylation changes between tumors and their paired normal samples were shared across patients, several were specific to a particular smoking status. For example, never-smokers displayed a greater proportion of hypomethylated differentially methylated regions (hypoDMRs) and a greater number of recurrently hypomethylated promoters, including those of ASPSCR1, TOP2A, DPP9, and USP39, all previously linked to cancer. Changes outside of promoters were also widespread and often recurrent, particularly methylation loss over repetitive elements, highly enriched for ERV1 subfamilies. Recurrent hypoDMRs were enriched for several transcription factor binding motifs, often for genes involved in signaling and cell proliferation. For example, 71% of recurrent promoter hypoDMRs contained a motif for NKX2-1. Finally, the majority of DMRs were located within an active chromatin state in tissues profiled using the Roadmap Epigenomics data, suggesting that methylation changes may contribute to altered regulatory programs through the adaptation of cell type-specific expression programs.

研究问题：表观遗传的变化，例如DNA甲基化修饰的改变，在癌症中非常普遍，并与遗传突变互为补充，影响癌症的进展和治疗效果。我们研究了非小细胞肺癌（NSCLC）患者中DNA甲基化在肿瘤表型中的作用，并使用不同吸烟史患者来评估吸烟对DNA甲基化的影响。研究方法：为了确定DNA甲基化改变对吸烟和非吸烟的非小细胞肺癌（NSCLC）患者肿瘤表型的潜在贡献，我们使用甲基化DNA免疫沉淀（MeDIP-seq）和甲基化敏感限制性酶消化后测序（MRE-seq）两种互补检测方法，对17个原发性NSCLC肿瘤和10个匹配的正常肺部样本进行了全基因组DNA甲基化分析。主要结果：我们发现了多个基因启动子反复发生的甲基化变化，其中许多已被报道与癌症有关，包括 FAM83A 和 SEPT9（低甲基化），以及 PCDH7、NKX2-1 和 SOX17（高甲基化）。虽然肿瘤及其配对的正常样本之间的许多甲基化变化在不同患者之间是共享的，但也有一些变化是某一吸烟史所特有的。例如，非吸烟者显示出更大比例的低甲基化差异甲基化区域（hypoDMRs）和更多的重复低甲基化启动子，包括 ASPSCR1、TOP2A、DPP9 和 USP39 的启动子，这些启动子均曾被报道与癌症有关。在启动子以外的基因序列上的变化也很普遍，而且经常反复发生，特别是重复元件上的甲基化缺失，ERV1 亚家族的甲基化缺失程度很高。反复出现的低DMR富集于几个转录因子结合基团，通常是涉及信号转导和细胞增殖的基因。例如，71%的重复出现的启动子低DMRs含有NKX2-1的基序。最后，在使用Roadmap参考表观基因组数据分析的组织中，大多数DMRs都位于活跃的染色质状态中，这表明甲基化变化可能会通过细胞类型特异性表达程序的调整而导致基因调控程序的改变。

Page 991-1013

Download 1507

Original Research

Differential Transcriptomic Landscapes of SARS-CoV-2 Variants in Multiple Organs from Infected Rhesus Macaques

Tingfu Du, Chunchun Gao, Shuaiyao Lu, Qianlan Liu, Yun Yang, Wenhai Yu, Wenjie Li, Yong Qiao Sun, Cong Tang, Junbin Wang, Jiahong Gao, Yong Zhang, Fangyu Luo, Ying Yang, Yun-Gui Yang, Xiaozhong Peng

View abstract

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) caused the persistent coronavirus disease 2019 (COVID-19) pandemic, which has resulted in millions of deaths worldwide and brought an enormous public health and global economic burden. The recurring global wave of infections has been exacerbated by growing variants of SARS-CoV-2. In this study, the virological characteristics of the original SARS-CoV-2 strain and its variants of concern (VOCs; including Alpha, Beta, and Delta) in vitro, as well as differential transcriptomic landscapes in multiple organs (lung, right ventricle, blood, cerebral cortex, and cerebellum) from the infected rhesus macaques, were elucidated. The original strain of SARS-CoV-2 caused a stronger innate immune response in host cells, and its VOCs markedly increased the levels of subgenomic RNAs, such as N, Orf9b, Orf6, and Orf7ab, which are known as the innate immune antagonists and the inhibitors of antiviral factors. Intriguingly, the original SARS-CoV-2 strain and Alpha variant induced larger alteration of RNA abundance in tissues of rhesus monkeys than Beta and Delta variants did. Moreover, a hyperinflammatory state and active immune response were shown in the right ventricles of rhesus monkeys by the up-regulation of inflammation- and immune-related RNAs. Furthermore, peripheral blood may mediate signaling transmission among tissues to coordinate the molecular changes in the infected individuals. Collectively, these data provide insights into the pathogenesis of COVID-19 at the early stage of infection by the original SARS-CoV-2 strain and its VOCs.

Page 1014-1029

Download 1107

Original Research

MicroRNA–disease Network Analysis Repurposes Methotrexate for the Treatment of Abdominal Aortic Aneurysm in Mice

Yicong Shen, Yuanxu Gao, Jiangcheng Shi, Zhou Huang, Rongbo Dai, Yi Fu, Yuan Zhou, Wei Kong, Qinghua Cui

View abstract

Abdominal aortic aneurysm (AAA) is a permanent dilatation of the abdominal aorta and is highly lethal. The main purpose of the current study is to search for noninvasive medical therapies for AAA, for which there is currently no effective drug therapy. Network medicine represents a cutting-edge technology, as analysis and modeling of disease networks can provide critical clues regarding the etiology of specific diseases and therapeutics that may be effective. Here, we proposed a novel algorithm to quantify disease relations based on a large accumulated microRNA–disease association dataset and then built a disease network covering 15 disease classes and 304 diseases. Analysis revealed some patterns for these diseases. For instance, diseases tended to be clustered and coherent in the network. Surprisingly, we found that AAA showed the strongest similarity with rheumatoid arthritis and systemic lupus erythematosus, both of which are autoimmune diseases, suggesting that AAA could be one type of autoimmune diseases in etiology. Based on this observation, we further hypothesized that drugs for autoimmune diseases could be repurposed for the prevention and therapy of AAA. Finally, animal experiments confirmed that methotrexate, a drug for autoimmune diseases, was able to alleviate the formation and development of AAA.

Page 1030-1042

Download 989

Method

AB-Gen: Antibody Library Design with Generative Pre-trained Transformer and Deep Reinforcement Learning

Xiaopeng Xu, Tiantian Xu, Juexiao Zhou, Xingyu Liao, Ruochi Zhang, Yu Wang, Lu Zhang, Xin Gao

View abstract

Antibody leads must fulfill multiple desirable properties to be clinical candidates. Primarily due to the low throughput in the experimental procedure, the need for such multi-property optimization causes the bottleneck in preclinical antibody discovery and development, because addressing one issue usually causes another. We developed a reinforcement learning (RL) method, named AB-Gen, for antibody library design using a generative pre-trained transformer (GPT) as the policy network of the RL agent. We showed that this model can learn the antibody space of heavy chain complementarity determining region 3 (CDRH3) and generate sequences with similar property distributions. Besides, when using human epidermal growth factor receptor-2 (HER2) as the target, the agent model of AB-Gen was able to generate novel CDRH3 sequences that fulfill multi-property constraints. Totally, 509 generated sequences were able to pass all property filters, and three highly conserved residues were identified. The importance of these residues was further demonstrated by molecular dynamics simulations, consolidating that the agent model was capable of grasping important information in this complex optimization task. Overall, the AB-Gen method is able to design novel antibody sequences with an improved success rate than the traditional propose-then-filter approach. It has the potential to be used in practical antibody design, thus empowering the antibody discovery and development process. The source code of AB-Gen is freely available at Zenodo (https://doi.org/10.5281/zenodo.7657016) and BioCode (https://ngdc.cncb.ac.cn/biocode/tools/BT007341).

抗体能够特异性识别和结合抗原，具有广泛的临床应用。目前，已经有上百种抗体药物在临床试验中或已经上市，用于治疗癌症、自身免疫性疾病、传染性疾病等多种病症。然而，开发高药效的抗体药物，需要对抗体序列进行多属性优化，包括特异性、亲和力、溶解度、粘度、表达水平和免疫原性等。优化过程通常耗时长、成本高、成功率低，成为抗体药物研发的瓶颈。为解决这一难题，AB-Gen 采用了基于 GPT 的强化学习框架，能够生成满足多个属性约束条件的新型 CDRH3 序列。

Page 1043-1053

Download 1132

Database

Database Commons: A Catalog of Worldwide Biological Databases

Lina Ma, Dong Zou, Lin Liu, Huma Shireen, Amir A. Abbasi, Alex Bateman, Jingfa Xiao, Wenming Zhao, Yiming Bao, Zhang Zhang

View abstract

Biological databases serve as a global fundamental infrastructure for the worldwide scientific community, which dramatically aid the transformation of big data into knowledge discovery and drive significant innovations in a wide range of research fields. Given the rapid data production, biological databases continue to increase in size and importance. To build a catalog of worldwide biological databases, we curate a total of 5825 biological databases from 8931 publications, which are geographically distributed in 72 countries/regions and developed by 1975 institutions (as of September 20, 2022). We further devise a z-index, a novel index to characterize the scientific impact of a database, and rank all these biological databases as well as their hosting institutions and countries in terms of citation and z-index. Consequently, we present a series of statistics and trends of worldwide biological databases, yielding a global perspective to better understand their status and impact for life and health sciences. An up-to-date catalog of worldwide biological databases, as well as their curated meta-information and derived statistics, is publicly available at Database Commons (https://ngdc.cncb.ac.cn/databasecommons/).

生物数据库是生命科学及相关学科研究的重要基础，为科学研究提供基础数据资源，变革生命科学研究模式，促进大数据驱动的科学发现和创新突破。随着生命科学数据的激增，世界各国不断加大生物数据库资源的建设投入，生物数据库数量、规模和重要性持续增加。然而，全球范围内长期缺乏生物数据库的全面调研，无法纵览全球生物数据库发展趋势，缺少全球生物数据库的标准化信息整合和评估平台。为此，研究团队建立全球生物数据库目录Database Commons，构建了生物数据库分类标准和结构化信息审编模型，研发多种评估方法，开发可实时更新的后台审编系统，联合国内外多家科研机构持续开展全球生物数据库信息审编。

Page 1054-1058

Download 1153

Database

OBIA: An Open Biomedical Imaging Archive

Enhui Jin, Dongli Zhao, Gangao Wu, Junwei Zhu, Zhonghuang Wang, Zhiyao Wei, Sisi Zhang, Anke Wang, Bixia Tang, Xu Chen, Yanling Sun, Zhe Zhang, Wenming Zhao, Yuanguang Meng

View abstract

With the development of artificial intelligence (AI) technologies, biomedical imaging data play an important role in scientific research and clinical application, but the available resources are limited. Here we present Open Biomedical Imaging Archive (OBIA), a repository for archiving biomedical imaging and related clinical data. OBIA adopts five data objects (Collection, Individual, Study, Series, and Image) for data organization, and accepts the submission of biomedical images of multiple modalities, organs, and diseases. In order to protect personal privacy, OBIA has formulated a unified de-identification and quality control process. In addition, OBIA provides friendly and intuitive web interfaces for data submission, browsing, and retrieval, as well as image retrieval. As of September 2023, OBIA has housed data for a total of 937 individuals, 4136 studies, 24,701 series, and 1,938,309 images covering 9 modalities and 30 anatomical sites. Collectively, OBIA provides a reliable platform for biomedical imaging data management and offers free open access to all publicly available data to support research activities throughout the world. OBIA can be accessed at https://ngdc.cncb.ac.cn/obia.

研究问题：生物医学影像数据在科学研究和临床应用中均发挥着重要作用，但可用资源有限。医学影像数据中涵盖大量的隐私信息，如何构建生物医学影像数据管理平台，既保障数据隐私信息的安全，又能促进全球数据的共享，是当前生物医学影像数据使用中存在的问题。研究方法：在本研究中，OBIA通过五个数据对象（Collection、Individual、Study、Series和Image）来组织数据，接受多种模态、器官和疾病的图像的提交。为了保护个人隐私，OBIA制定了统一的去识别和质量控制流程。此外，OBIA在数据提交，浏览，元信息及图像检索中均提供了友好直观的网络界面。主要结果1：通过Collection、Individual、Study、Series和Image的伞状结构收录元信息，与GSA-Human中Individual关联，促进多组学数据的使用。主要结果2：制定了统一的去识别和质量控制流程，保障影像数据中隐私信息的安全情况下，又实现对关键信息的提取和规范化收录。主要结果3：在数据提交，浏览，元信息及图像检索中均提供了友好直观的网络界面，用户可以很方便的实现对数据的提交，浏览和检索。 OBIA访问链接： https://ngdc.cncb.ac.cn/obia

Page 1059-1065

Download 1477

Database

RCoV19: A One-stop Hub for SARS-CoV-2 Genome Data Integration, Variant Monitoring, and Risk Pre-warning

Cuiping Li, Lina Ma, Dong Zou, Rongqin Zhang, Xue Bai, Lun Li, Gangao Wu, Tianhao Huang, Wei Zhao, Enhui Jin, Yiming Bao, Shuhui Song

View abstract

The Resource for Coronavirus 2019 (RCoV19) is an open-access information resource dedicated to providing valuable data on the genomes, mutations, and variants of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). In this updated implementation of RCoV19, we have made significant improvements and advancements over the previous version. Firstly, we have implemented a highly refined genome data curation model. This model now features an automated integration pipeline and optimized curation rules, enabling efficient daily updates of data in RCoV19. Secondly, we have developed a global and regional lineage evolution monitoring platform, alongside an outbreak risk pre-warning system. These additions provide a comprehensive understanding of SARS-CoV-2 evolution and transmission patterns, enabling better preparedness and response strategies. Thirdly, we have developed a powerful interactive mutation spectrum comparison module. This module allows users to compare and analyze mutation patterns, assisting in the detection of potential new lineages. Furthermore, we have incorporated a comprehensive knowledgebase on mutation effects. This knowledgebase serves as a valuable resource for retrieving information on the functional implications of specific mutations. In summary, RCoV19 serves as a vital scientific resource, providing access to valuable data, relevant information, and technical support in the global fight against COVID-19. The complete contents of RCoV19 are available to the public at https://ngdc.cncb.ac.cn/ncov/.

研究问题新冠肺炎（COVID-19）是近一个世纪以来传播范围最广、影响最大的流行病，新冠病毒（SARS-CoV-2）的基因组序列数量也是远超其他已知病毒序列的总和。海量的新冠病毒基因组序列对数据的快速整合分析与挖掘带来了前所未有的挑战。当前，新冠肺炎疫情仍在全球蔓延，新冠病毒的基因组也在不断发生变异演化。因此，发展并建立大规模新冠病毒基因组数据的自动化整合、实时监测和高风险株系预警的方法平台，对促进全球新冠病毒的科学研究和公共卫生安全体系的建设具有重要应用价值和科学意义。研究方法本研究通过开发全自动化的数据智能审编模型和数据共享页面，开展全球新冠病毒基因组数据自动化收集、去冗余、交叉引用和质量评估等工作，实现数据的每日高效更新。基于数据库整合的海量数据，建立基因组快速变异解析流程、单倍型网络演化构建算法以及基于机器学习的高风险株系预警模型，开发了新冠病毒传播演化实时监测平台、高风险变异株预警可视化系统和交互式突变谱快速比对模块，实现新冠病毒基因组序列、变异和演化支系的可视化动态监测，高风险变异株系的及早预警，以及重要序列或谱系的变异特征规律分析。此外，通过人工审编新冠病毒基因组的突变效应知识，涵盖感染性/传染性、抗体抗性、药物抗性和T细胞表位等，实现新冠病毒变异知识库的升级更新，帮助科研人员及防控政策决策人员更好地理解新冠病毒的变异特性，为科学研究及防控决策提供重要的参考依据。主要成果主要成果1 开发了一站式自动化新冠病毒基因组数据审编模型和数据共享页面，开展全球新冠病毒基因组数据自动化收集、去冗余、交叉引用和质量评估等工作，持续提供实时全面的新冠病毒基因组元信息、全球分布与统计等信息，以及高级检索服务。主要成果2 建立了基因组快速变异解析流程、单倍型网络演化构建算法以及基于机器学习的高风险株系预警模型，开发了新冠病毒传播演化实时监测平台、高风险变异株预警可视化系统和交互式突变谱快速比对模块，实现了新冠病毒基因组序列、变异和演化支系的可视化动态监测，高风险变异株系的及早预警，以及重要序列或谱系的变异特征规律分析。这些方法平台为病原基因大数据驱动的公共卫生安全响应提供了重要的技术和数据支持。主要成果3 人工审编了新冠病毒基因组的突变效应知识，涵盖感染性/传染性、抗体抗性、药物抗性和T细胞表位等多维信息。在此基础上，升级更新了新冠病毒变异知识库，可帮助科研人员及防控政策决策人员更好地理解新冠病毒的变异特性，为科学研究及防控决策提供重要的参考依据。数据库介绍 2019新型冠状病毒信息库——RCoV19（前身为2019-nCoVR），于2020年1月22日发布，是全球首个公开发布的新冠病毒综合性信息库。该信息库动态整合了全球新冠病毒基因组序列及元信息，支持全球新冠病毒基因组数据的汇交存储与共享，并提供突变注释信息和演化支系等信息，已发展为国际上规模最大、资源最丰富的新冠病毒研究公共平台。在过去的几年中，已经为全球182个国家和地区的375万余访客提供了数据服务，累计数据下载超过184亿条次。总之，RCoV19是新冠病毒研究中非常重要的开放共享科学资源，为全球抗击新冠肺炎疫情提供了重要的数据、平台和技术支撑。 RCoV19访问链接：https://ngdc.cncb.ac.cn/ncov/

Page 1066-1079

Download 1546