Volume: 19, Issue: 4

Preview

From Reads to Insights: Integrative Pipelines for Biological Interpretation of ATAC-seq Data

Ya Cui, Jason Sheng Li, Wei Li

ATAC-seq被广泛用于测定全基因组范围内染色质的开放区域。它利用Tn5转座酶仅能在开放的染色质插入的特性,为染色质开放区域的测定提供了一种简单、快速、起始细胞量低的解决方案。近年来,ATAC-seq的数据积累成指数增长,一些国际大型项目如TCGA和CommonMind等甚至为大量人群样本测定了ATAC-seq数据。如何全面、深入分析ATAC-seq数据成了科研人员不得不面对的问题。然而,目前研究人员直接用并不完全适用的ChIP-seq或者DNase-seq数据的分析软件来对ATAC-seq数据进行分析,针对ATAC-seq数据的综合分析流程也还没有完全确定。在《GPB》杂志的最新一期中,Liu等和Qiu等分别开发了针对ATAC-seq数据的整合分析流程:AIAP和CoBRA,为研究人员对ATAC-seq数据的综合分析提供了参考。本文针对AIAP和CoBRA这两个ATAC-seq数据的整合分析流程进行了介绍。

Page 519-521


Research Article

SPA: A Quantitation Strategy for MS Data in Patient-derived Xenograft Models

Xi Cheng, Lili Qian, Bo Wang, Minjia Tan, Jing Li

With the development of mass spectrometry (MS)-based proteomics technologies, patient-derived xenograft (PDX), which is generated from the primary tumor of a patient, is widely used for the proteome-wide analysis of cancer mechanism and biomarker identification of a drug. However, the proteomics data interpretation is still challenging due to complex data deconvolution from the PDX sample that is a cross-species mixture of human cancerous tissues and immunodeficient mouse tissues. In this study, by using the lab-assembled mixture of human and mouse cells with different mixing ratios as a benchmark, we developed and evaluated a new method, SPA (shared peptide allocation), for protein quantitation by considering the unique and shared peptides of both species. The results showed that SPA could provide more convenient and accurate protein quantitation in human–mouse mixed samples. Further validation on a pair of gastric PDX samples (one bearing FGFR2 amplification while the other one not) showed that our new method not only significantly improved the overall protein identification, but also detected the differential phosphorylation of FGFR2 and its downstream mediators (such as RAS and ERK) exclusively. The tool pdxSPA is freely available at https://github.com/Li-Lab-Proteomics/pdxSPA.

Page 522-533


Method

RePhine: An Integrative Method for Identification of Drug Response-related Transcriptional Regulators

Xujun Wang, Zhengtao Zhang, Wenyi Qin, Shiyi Liu, Cong Liu, Georgi Z. Genchev, Lijian Hui, Hongyu Zhao, Hui Lu

Transcriptional regulators (TRs) participate in essential processes in cancer pathogenesis and are critical therapeutic targets. Identification of drug response-related TRs from cell line-based compound screening data is often challenging due to low mRNA abundance of TRs, protein modifications, and other confounders (CFs). In this study, we developed a regression-based pharmacogenomic and ChIP-seq data integration method (RePhine) to infer the impact of TRs on drug response through integrative analyses of pharmacogenomic and ChIP-seq data. RePhine was evaluated in simulation and pharmacogenomic data and was applied to pan-cancer datasets with the goal of biological discovery. In simulation data with added noises or CFs and in pharmacogenomic data, RePhine demonstrated an improved performance in comparison with three commonly used methods (including Pearson correlation analysis, logistic regression model, and gene set enrichment analysis). Utilizing RePhine and Cancer Cell Line Encyclopedia data, we observed that RePhine-derived TR signatures could effectively cluster drugs with different mechanisms of action. RePhine predicted that loss-of-function of EZH2/PRC2 reduces cancer cell sensitivity toward the BRAF inhibitor PLX4720. Experimental validation confirmed that pharmacological EZH2 inhibition increases the resistance of cancer cells to PLX4720 treatment. Our results support that RePhine is a useful tool for inferring drug response-related TRs and for potential therapeutic applications. The source code for RePhine is freely available at https://github.com/coexps/RePhine.

Page 534-548


Method

NOGEA: A Network-oriented Gene Entropy Approach for Dissecting Disease Comorbidity and Drug Repositioning

Zihu Guo, Yingxue Fu, Chao Huang, Chunli Zheng, Ziyin Wu, Xuetong Chen, Shuo Gao, Yaohua Ma, Mohamed Shahen, Yan Li, Pengfei Tu, Jingbo Zhu, Zhenzhong Wang, Wei Xiao, Yonghua Wang

Rapid development of high-throughput technologies has permitted the identification of an increasing number of disease-associated genes (DAGs), which are important for understanding disease initiation and developing precision therapeutics. However, DAGs often contain large amounts of redundant or false positive information, leading to difficulties in quantifying and prioritizing potential relationships between these DAGs and human diseases. In this study, a network-oriented gene entropy approach (NOGEA) is proposed for accurately inferring master genes that contribute to specific diseases by quantitatively calculating their perturbation abilities on directed disease-specific gene networks. In addition, we confirmed that the master genes identified by NOGEA have a high reliability for predicting disease-specific initiation events and progression risk. Master genes may also be used to extract the underlying information of different diseases, thus revealing mechanisms of disease comorbidity. More importantly, approved therapeutic targets are topologically localized in a small neighborhood of master genes in the interactome network, which provides a new way for predicting drug-disease associations. Through this method, 11 old drugs were newly identified and predicted to be effective for treating pancreatic cancer and then validated by in vitro experiments. Collectively, the NOGEA was useful for identifying master genes that control disease initiation and co-occurrence, thus providing a valuable strategy for drug efficacy screening and repositioning. NOGEA codes are publicly available at https://github.com/guozihuaa/NOGEA.

Page 549-564


Method

DeepCAPE: A Deep Convolutional Neural Network for the Accurate Prediction of Enhancers

Shengquan Chen; Mingxin Gan; Hairong Lv; Rui Jiang

The establishment of a landscape of enhancers across human cells is crucial to deciphering the mechanism of gene regulation, cell differentiation, and disease development. High-throughput experimental approaches, which contain successfully reported enhancers in typical cell lines, are still too costly and time-consuming to perform systematic identification of enhancers specific to different cell lines. Existing computational methods, capable of predicting regulatory elements purely relying on DNA sequences, lack the power of cell line-specific screening. Recent studies have suggested that chromatin accessibility of a DNA segment is closely related to its potential function in regulation, and thus may provide useful information in identifying regulatory elements. Motivated by the aforementioned understanding, we integrate DNA sequences and chromatin accessibility data to accurately predict enhancers in a cell line-specific manner. We proposed DeepCAPE, a deep convolutional neural network to predict enhancers via the integration of DNA sequences and DNase-seq data. Benefitting from the well-designed feature extraction mechanism and skip connection strategy, our model not only consistently outperforms existing methods in the imbalanced classification of cell line-specific enhancers against background sequences, but also has the ability to self-adapt to different sizes of datasets. Besides, with the adoption of auto-encoder, our model is capable of making cross-cell line predictions. We further visualize kernels of the first convolutional layer and show the match of identified sequence signatures and known motifs. We finally demonstrate the potential ability of our model to explain functional implications of putative disease-associated genetic variants and discriminate disease-related enhancers. The source code and detailed tutorial of DeepCAPE are freely available at https://github.com/ShengquanChen/DeepCAPE.
系统地认识人类基因组中的增强子对于解析基因调控机制、细胞分化以及疾病发育是至关重要的。虽然高通量测序技术能够有效识别特定细胞系中的增强子,但这些实验非常耗时且需要花费大量财力物力。而现有的计算方法虽然能够在一定程度上预测增强子,但由于仅使用了DNA序列信息,无法刻画增强子的细胞系特异性。最近研究表明,DNA片段的染色质开放性与它的潜在调控功能高度相关,将有助于调控元件的识别。因此,我们通过建立DeepCAPE这一深度卷积神经网络来整合DNA序列信息和染色质开放性数据,从而准确地预测不同细胞系中的增强子。相较于现有的方法,DeepCAPE不仅能够在极度不均衡的数据中更好地预测细胞系特异的增强子,而且能够自动适应不同大小的数据集以保持理想的效果。此外,我们的方法还基于自编码器实现了对增强子进行跨细胞系的预测。模型中卷积核学习到的序列特征也于已知基序相吻合。最后,我们举例说明了DeepCAPE能够被有效地用于解释疾病相关的增强子和遗传变异。

Page 565-577


Database

The Genome Sequence Archive Family: Toward Explosive Data Growth and Diverse Data Types

Tingting Chen, Xu Chen, Sisi Zhang, Junwei Zhu, Bixia Tang, Anke Wang, Lili Dong, Zhewen Zhang, Caixia Yu, Yanling Sun, Lianjiang Chi, Huanxin Chen, Shuang Zhai, Yubin Sun, Li Lan, Xin Zhang, Jingfa Xiao, Yiming Bao, Yanqing Wang, Zhang Zhang, Wenming Zhao

The Genome Sequence Archive (GSA) is a data repository for archiving raw sequence data, which provides data storage and sharing services for worldwide scientific communities. Considering explosive data growth with diverse data types, here we present the GSA family by expanding into a set of resources for raw data archive with different purposes, namely, GSA (https://ngdc.cncb.ac.cn/gsa/), GSA for Human (GSA-Human, https://ngdc.cncb.ac.cn/gsa-human/), and Open Archive for Miscellaneous Data (OMIX, https://ngdc.cncb.ac.cn/omix/). Compared with the 2017 version, GSA has been significantly updated in data model, online functionalities, and web interfaces. GSA-Human, as a new partner of GSA, is a data repository specialized in human genetics-related data with controlled access and security. OMIX, as a critical complement to the two resources mentioned above, is an open archive for miscellaneous data. Together, all these resources form a family of resources dedicated to archiving explosive data with diverse types, accepting data submissions from all over the world, and providing free open access to all publicly available data in support of worldwide research activities.
组学原始数据归档库Genome Sequence Archive(GSA) 是生命组学原始测序数据管理的公益性数据库,旨在推动全球生命组学数据的共享与应用。近年来,随着组学数据的爆炸性增长和数据类型的多样化,以及人类遗传资源数据管理的特殊需求,我们对GSA数据库进行了更新和扩展,形成一个 GSA 数据资源库家族(GSA Family),包括 GSA (https://ngdc.cncb.ac.cn/gsa/),GSA for Human (GSA-Human, https://ngdc.cncb.ac.cn/gsa-human/) 和 Open Archive for Miscellaneous Data (OMIX, https://ngdc.cncb.ac.cn/omix/)。 GSA数据库与 2017发布的版本相比,在数据模型、系统功能和数据提交方式等方面都进行了更新;GSA-Human是一个专门用于存储人类遗传资源数据的数据库,可实现人类遗传资源数据的受控访问,保障人类遗传资源数据的安全性;OMIX 数据库是一个用于存储非原始测序数据的归档库,如环境组、表型组、代谢组等,它作为上述两种数据资源库的重要补充,有效地解决了用户对提交除原始测序数据外的其它类型数据存储需求。GSA Family各数据资源库致力于汇交和管理各种类型的数据,接受来自全世界的科研工作者的数据提交,并对所有公开可用数据提供免费开放访问,以支持全球的生命科学研究活动。

Page 578-583


Database

Genome Warehouse: A Public Repository Housing Genome-scale Data

Meili Chen, Yingke Ma, Song Wu, Xinchang Zheng, Hongen Kang, Jian Sang, Xingjian Xu, Lili Hao, Zhaohua Li, Zheng Gong, Jingfa Xiao, Zhang Zhang, Wenming Zhao, Yiming Bao

The Genome Warehouse (GWH) is a public repository housing genome assembly data for a wide range of species and delivering a series of web services for genome data submission, storage, release, and sharing. As one of the core resources in the National Genomics Data Center (NGDC), part of the China National Center for Bioinformation (CNCB; https://ngdc.cncb.ac.cn), GWH accepts both full and partial (chloroplast, mitochondrion, and plasmid) genome sequences with different assembly levels, as well as an update of existing genome assemblies. For each assembly, GWH collects detailed genome-related metadata of biological project, biological sample, and genome assembly, in addition to genome sequence and annotation. To archive high-quality genome sequences and annotations, GWH is equipped with a uniform and standardized procedure for quality control. Besides basic browse and search functionalities, all released genome sequences and annotations can be visualized with JBrowse. By May 21, 2021, GWH has received 19,124 direct submissions covering a diversity of 1108 species and has released 8772 of them. Collectively, GWH serves as an important resource for genome-scale data management and provides free and publicly accessible data to support research activities throughout the world. GWH is publicly accessible at https://ngdc.cncb.ac.cn/gwh.
基因组数据库(Genome Warehouse, GWH)是存储多物种基因组拼接数据并允许公开访问的资源库,它提供基因组数据的汇交、存储、发布和共享等一系列web服务。作为国家生物信息中心(CNCB)、国家基因组科学数据中心(NGDC)的一个核心资源,GWH接受不同组装级别的完整基因组和部分基因组(叶绿体基因组、线粒体基因组、质粒基因组)序列的汇交,以及对已有基因组拼接数据的更新。对于每一个基因组拼接,除了基因组序列和注释外,GWH还收集详细的基因组相关的元数据(包括生物项目、生物样本和基因组拼接的元数据)。GWH配套了一个统一且标准化的质量控制流程,用于归档高质量的序列和注释。GWH除了提供浏览、检索等基本功能外,同时已发布的基因组序列和注释数据可以通过JBrowse进行可视化。截至2021年5月21日,GWH已经收到了用户直接汇交的19,124个基因组拼接数据,涵盖1108个物种,并且已经发布了其中的8772个拼接数据。综上所述,GWH是一个管理大规模基因组数据的重要资源,并面向全球科研人员提供免费、可公开获取的基因组数据。GWH可以通过https://ngdc.cncb.ac.cn/gwh公开访问。

Page 584-589


Database

REVA as A Well-curated Database for Human Expression-modulating Variants

Yu Wang, Fang-Yuan Shi, Yu Liang, Ge Gao

More than 90% of disease- and trait-associated human variants are noncoding. By systematically screening multiple large-scale studies, we compiled REVA, a manually curated database for over 11.8 million experimentally tested noncoding variants with expression-modulating potentials. We provided 2424 functional annotations that could be used to pinpoint the plausible regulatory mechanism of these variants. We further benchmarked multiple state-of-the-art computational tools and found that their limited sensitivity remains a serious challenge for effective large-scale analysis. REVA provides high-quality experimentally tested expression-modulating variants with extensive functional annotations, which will be useful for users in the noncoding variant community. REVA is freely available at http://reva.gao-lab.org.
研究问题: 实验验证的表达调控相关非编码变异收集整合、功能注释和对相关预测工具的评估。 研究背景: 人类基因组中97%的区域虽不编码蛋白,但仍具有不可忽视的功能,已知超过90%与疾病和性状关联的变异均位于非编码区。为有效寻找发现这些功能性非编码变异,近年来国际上已开发了一批计算工具,但这些工具的性能仍有待进一步评估。通过对多个大规模实验验证表达调控变异实验产生的数据的收集和整合,我们构建了高可信度表达调控变异数据库REVA。REVA目前收集了来自18个细胞系的超过1180万个具有调控基因表达潜力的非编码变异,并应用卷积神经网络(CNN)模型对变异进行了多角度功能注释,为理解这些非编码变异的功能与机制提供了重要基础与线索。与此同时,我们基于收集到的高质量变异数据集进一步评估了7个主流的非编码变异预测工具,发现其灵敏度(sensitivity)仍有待提升,在大规模分析中现有工具造成的假阴性是亟待注意的问题。 主要结果1: 通过收集整合数据,构建了最大的实验验证的人类表达调控相关非编码变异数据库,并基于卷积神经网络对变异进行了2424个功能注释,为研究变异发挥功能的生物学机制提供线索。 主要结果2: 基于收集的高质量数据,对主流预测非编码变异功能的工具进行了评估,发现被测工具的灵敏度均相对较低,是亟待解决的重要问题。 数据库链接: http://reva.gao-lab.org

Page 590-601


Database

SmProt: A Reliable Repository with Comprehensive Annotation of Small Proteins Identified from Ribosome Profiling

Yanyan Li, Honghong Zhou, Xiaomin Chen, Yu Zheng, Quan Kang, Di Hao, Lili Zhang, Tingrui Song, Huaxia Luo, Yajing Hao, Runsheng Chen, Peng Zhang, Shunmin He

Small proteins specifically refer to proteins consisting of less than 100 amino acids translated from small open reading frames (sORFs), which were usually missed in previous genome annotation. The significance of small proteins has been revealed in current years, along with the discovery of their diverse functions. However, systematic annotation of small proteins is still insufficient. SmProt was specially developed to provide valuable information on small proteins for scientific community. Here we present the update of SmProt, which emphasizes reliability of translated sORFs, genetic variants in translated sORFs, disease-specific sORF translation events or sequences, and remarkably increased data volume. More components such as non-ATG translation initiation, function, and new sources are also included. SmProt incorporated 638,958 unique small proteins curated from 3,165,229 primary records, which were computationally predicted from 419 ribosome profiling (Ribo-seq) datasets or collected from literature and other sources from 370 cell lines or tissues in 8 species (Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Danio rerio, Saccharomyces cerevisiae, Caenorhabditis elegans, and Escherichia coli). In addition, small protein families identified from human microbiomes were also collected. All datasets in SmProt are free to access, and available for browse, search, and bulk downloads at http://bigdata.ibp.ac.cn/SmProt/.
研究问题: 小蛋白数据库 (Small Protein Repository, SmProt) 的构建。 研究背景: 小开放阅读框 (small open reading frame, sORFs) 广泛存在于人类等许多生物的基因组中,其可位于编码区、mRNA 的非翻译区 (untranslated regions, UTR) 以及多种非编码 RNA (non-coding RNA, ncRNA) 上,部分能够翻译成小蛋白。研究发现小蛋白在多种生物学过程中行使功能,并与多种疾病相关,但在以往的基因组注释中通常被忽略。SmProt致力于整合多来源的小蛋白信息,尤其翻译自lncRNA与UTR的小蛋白。 基于对公共核糖体图谱 (Ribosome profiling, Ribo-seq) 数据的广泛收集、严格质控与重新分析,对已发表文献、数据库的信息挖掘,对多来源信息的交叉整合,对结果的合并去冗余,对数据框架的重构,作者团队发布了全新的SmProt,提供更加系统、丰富、准确的小蛋白注释,相关信息均允许高效的在线浏览、检索、可视化、BLAST、下载。 主要成果1: SmProt包含638,958个非冗余小蛋白,来自8个物种300多个组织/细胞系3,165,229条sORF翻译事件记录,并强调所鉴定翻译事件的可靠性。 主要成果2: 提供小蛋白的来源、分子特征、基因组区域编码潜力、翻译起始、翻译水平、功能域注释、多级别翻译证据等丰富注释,并发现大量由ncRNA与UTR编码的小蛋白,为功能基因组学研究提供参考。 主要成果3: 提供小蛋白相关变异及影响,以及疾病关联信息,为临床医学研究提供参考。 数据库链接: http://bigdata.ibp.ac.cn/SmProt

Page 602-610


Database

OGP: A Repository of Experimentally Characterized O-glycoproteins to Facilitate Studies on O-glycosylation

Jiangming Huang, Mengxi Wu, Yang Zhang, Siyuan Kong, Mingqi Liu, Biyun Jiang, Pengyuan Yang, Weiqian Cao

Numerous studies on cancers, biopharmaceuticals, and clinical trials have necessitated comprehensive and precise analysis of protein O-glycosylation. However, the lack of updated and convenient databases deters the storage of and reference to emerging O-glycoprotein data. To resolve this issue, an O-glycoprotein repository named OGP was established in this work. It was constructed with a collection of O-glycoprotein data from different sources. OGP contains 9354 O-glycosylation sites and 11,633 site-specific O-glycans mapping to 2133 O-glycoproteins, and it is the largest O-glycoprotein repository thus far. Based on the recorded O-glycosylation sites, an O-glycosylation site prediction tool was developed. Moreover, an OGP-based website is already available (https://www.oglyp.org/). The website comprises four specially designed and user-friendly modules: statistical analysis, database search, site prediction, and data submission. The first version of OGP repository and the website allow users to obtain various O-glycoprotein-related information, such as protein accession Nos., O-glycosylation sites, O-glycopeptide sequences, site-specific O-glycan structures, experimental methods, and potential O-glycosylation sites. O-glycosylation data mining can be performed efficiently on this website, which will greatly facilitate related studies. In addition, the database is accessible from OGP website (https://www.oglyp.org/download.php).
全面准确的蛋白质O-糖基化解析,在癌症、临床和生物制药等研究领域至关重要。然而,目前O-糖蛋白数据库的缺乏给O-糖蛋白研究带来了很大限制。为了解决这个问题,本研究建立了一个名为OGP的O-糖蛋白库。OGP收集了不同来源的基于质谱的O-糖蛋白数据;包含9354个O-糖基化位点,2133个O-糖蛋白和11,633个位点特异性O-聚糖,是迄今为止最大的O-糖蛋白数据库。此外,我们基于OGP中记录的O-糖基化位点,开发了O-糖基化位点预测工具;并建立了相关网站 (http://www.oglyp.org/)。该网站包括四个用户友好的模块:统计分析、数据搜索、位点预测和数据提交。用户可通过网站检索各种O-糖蛋白相关信息,如蛋白质编号、O-糖基化位点、糖肽序列、位点特异性聚糖结构、实验方法和参考文献等;并可进行潜在糖基化位点预测;还可以从网站方便地下载或上传数据。该网站可以高效地进行O-糖基化数据挖掘,为相关研究提供了便利。

Page 611-618


Application Note

rMVP: A Memory-efficient, Visualization-enhanced, and Parallel-accelerated Tool for Genome-wide Association Study

Lilin Yin, Haohao Zhang, Zhenshuang Tang, Jingya Xu, Dong Yin, Zhiwu Zhang, Xiaohui Yuan, Mengjin Zhu, Shuhong Zhao, Xinyun Li, Xiaolei Liu

Along with the development of high-throughput sequencing technologies, both sample size and SNP number are increasing rapidly in genome-wide association studies (GWAS), and the associated computation is more challenging than ever. Here, we present a memory-efficient, visualization-enhanced, and parallel-accelerated R package called “rMVP” to address the need for improved GWAS computation. rMVP can 1) effectively process large GWAS data, 2) rapidly evaluate population structure, 3) efficiently estimate variance components by Efficient Mixed-Model Association eXpedited (EMMAX), Factored Spectrally Transformed Linear Mixed Models (FaST-LMM), and Haseman-Elston (HE) regression algorithms, 4) implement parallel-accelerated association tests of markers using general linear model (GLM), mixed linear model (MLM), and fixed and random model circulating probability unification (FarmCPU) methods, 5) compute fast with a globally efficient design in the GWAS processes, and 6) generate various visualizations of GWAS-related information. Accelerated by block matrix multiplication strategy and multiple threads, the association test methods embedded in rMVP are significantly faster than PLINK, GEMMA, and FarmCPU_pkg. rMVP is freely available at https://github.com/xiaolei-lab/rMVP.
随着高通量测序技术的发展,用于全基因组关联分析(GWAS)的群体数量及标记密度迅速增长,庞大的计算量带来了空前的计算负担。为了应对大数据时代的计算挑战,本研究开发了一款兼具计算高效、内存节省、可视化功能丰富等特性的全基因组关联分析工具rMVP。该工具能够1)高效读取调用大规模数据;2)快速评估群体遗传结构;3)利用EMMAX、Fast-LMM、HE回归等算法高效估计方差组分;4)实现了GLM、MLM和FarmCPU等关联分析模型的并行加速;5)通过对整个计算流程的全局优化避免了重复的大矩阵运算;6)输出与GWAS相关的多类型、多格式高质量图片。借助于分块矩阵计算策略及多线程并行计算技术,rMVP中的关联分析方法计算速度相比PLINK、GEMMA、FarmCPU_pkg软件中对应方法快近5-20倍。rMVP可通过以下链接免费安装:https://github.com/xiaolei-lab/rMVP。

Page 619-628


Application Note

GAPIT Version 3: Boosting Power and Accuracy for Genomic Association and Prediction

Jiabo Wang, Zhiwu Zhang

Genome-wide association study (GWAS) and genomic prediction/selection (GP/GS) are the two essential enterprises in genomic research. Due to the great magnitude and complexity of genomic and phenotypic data, analytical methods and their associated software packages are frequently advanced. GAPIT is a widely-used genomic association and prediction integrated tool as an R package. The first version was released to the public in 2012 with the implementation of the general linear model (GLM), mixed linear model (MLM), compressed MLM (CMLM), and genomic best linear unbiased prediction (gBLUP). The second version was released in 2016 with several new implementations, including enriched CMLM (ECMLM) and settlement of MLMs under progressively exclusive relationship (SUPER). All the GWAS methods are based on the single-locus test. For the first time, in the current release of GAPIT, version 3 implemented three multi-locus test methods, including multiple loci mixed model (MLMM), fixed and random model circulating probability unification (FarmCPU), and Bayesian-information and linkage-disequilibrium iteratively nested keyway (BLINK). Additionally, two GP/GS methods were implemented based on CMLM (named compressed BLUP; cBLUP) and SUPER (named SUPER BLUP; sBLUP). These new implementations not only boost statistical power for GWAS and prediction accuracy for GP/GS, but also improve computing speed and increase the capacity to analyze big genomic data. Here, we document the current upgrade of GAPIT by describing the selection of the recently developed methods, their implementations, and potential impact. All documents, including source code, user manual, demo data, and tutorials, are freely available at the GAPIT website (http://zzlab.net/GAPIT).
研究问题: 全基因组关联分析与预测软件(GAPIT version 3)的构建。 研究背景: GAPIT是一款基于R语言平台编译,整合了多种全基因组关联分析和全基因组选择新算法,集输出相关图形、表格于一身的免费软件。全基因组关联分析包括一般线性模型,GLM;混合线性模型,MLM;压缩式混合线性模型,CMLM;改进版压缩式混合线性模型(ECMLM);快速式混合线性模型(Fast-LMM);快速选择式混合线性模型(Fast-LMM-Select);逐步排它性混合线性模型(SUPER);多位点混合线性模型(MLMM);固定随机模型循环概率模型(FarmCPU);和贝叶斯信息与连锁不平衡迭代嵌套式模型(BLINK)。全基因组选择包括基因组最佳线性无偏估计(gBLUP);压缩式最佳线性无偏估计(CBLUP);和SUPER 最佳线性无偏估计(SBLUP)。截止目前GAPIT已经发展到第三个版本,前两个版本累计引用超过1400余次,从2016年至今,GAPIT网站累计获得34,000次以上的网页访问,使GAPIT成为目前世界上研究人类疾病、动植物育种全基因组学关联分析和预测的重要分析工具。 主要成果1: 整合目前最新的全基因组关联分析算法(MLMM,FarmCPU和BLINK)以及全基因组选择预测算法(CBLUP,SBLUP)。 主要成果2: 经过逻辑重排,代码优化,使多种算法独立存在于GAPIT3中,用户无须分别下载、安装和维护。 主要成果3: 创建交互式输出,包括交互式曼哈顿图,交互式分位数-分位数(QQ)图以及交互式全基因组选择预测结果。 软件主页及Github连接: http://zzlab.net/GAPIT https://github.com/jiabowang/GAPIT3

Page 629-640


Application Note

AIAP: A Quality Control and Integrative Analysis Package to Improve ATAC-seq Data Analysis

ShaopengLiu1#DaofengLi2#ChengLyu1#Paul M. Gontarz, Benpeng Miao, Pamela A.F. Madden, Ting Wang, Bo Zhang

Assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) is a technique widely used to investigate genome-wide chromatin accessibility. The recently published Omni-ATAC-seq protocol substantially improves the signal/noise ratio and reduces the input cell number. High-quality data are critical to ensure accurate analysis. Several tools have been developed for assessing sequencing quality and insertion size distribution for ATAC-seq data; however, key quality control (QC) metrics have not yet been established to accurately determine the quality of ATAC-seq data. Here, we optimized the analysis strategy for ATAC-seq and defined a series of QC metrics for ATAC-seq data, including reads under peak ratio (RUPr), background (BG), promoter enrichment (ProEn), subsampling enrichment (SubEn), and other measurements. We incorporated these QC tests into our recently developed ATAC-seq Integrative Analysis Package (AIAP) to provide a complete ATAC-seq analysis system, including quality assurance, improved peak calling, and downstream differential analysis. We demonstrated a significant improvement of sensitivity (20%–60%) in both peak calling and differential analysis by processing paired-end ATAC-seq datasets using AIAP. AIAP is compiled into Docker/Singularity, and it can be executed by one command line to generate a comprehensive QC report. We used ENCODE ATAC-seq data to benchmark and generate QC recommendations, and developed qATACViewer for the user-friendly interaction with the QC report. The software, source code, and documentation of AIAP are freely available at https://github.com/Zhang-lab/ATAC-seq_QC_analysis.
ATAC-seq是利用Tn5转座酶研究染色质开放性(chromatin accessibility)的高通量测序技术。随着ATAC-seq技术的发展,它已逐渐成为研究染色质开放性的主流方法,因此开发ATAC-seq数据分析软件显得尤为必要,而现有的ATAC-seq分析工具无法对ATAC-seq数据进行系统性的分析。为了满足这一需要,我们对常用的ATAC-seq数据分析方式进行了优化,并结合基准化分析建立了一系列评估ATAC-seq数据质量的标准,包括峰值信号比例(RUPr)、背景噪音(BG)、启动子信号丰度(ProEn)、抽样信号丰度(SubEn)等。我们将这些评估标准、分析流程与现有的工具与算法(如BWA, MethylQA, MACS2, DESeq2)整合,研发了能够对ATAC-seq数据进行系统性分析的软件AIAP。AIAP可以通过Docker/Singularity镜像进行编译,通过单命令行指令,AIAP可以对任意ATAC-seq数据进行分析并生成综合性的质量评估报告。进一步的测试表明,AIAP对ATCA-seq数据峰值的检测(peak calling)和对差异性开放区域(differentially accessible region) 检测的灵敏度提升了20%~60%。我们利用AIAP软件分析了ENCODE ATAC-seq数据,据此建立了对ATAC-seq数据质量评估的参考标准。所有的数据质量报告和参考标准都可以通过我们开发的软件qATACViewer进行可视化比对。最后,AIAP代码已经在GitHub开源。

Page 641-651


Application Note

CoBRA: Containerized Bioinformatics Workflow for Reproducible ChIP/ATAC-seq Analysis

Xintao Qiu, Avery S. Feit, Ariel Feiglin, Yingtian Xie, Nikolas Kesten, Len Taing, Joseph Perkins, Shengqing Gu, Yihao Li, Paloma Cejas, Ningxuan Zhou, Rinath Jeselsohn, Myles Brown, X. Shirley Liu, Henry W. Long

Chromatin immunoprecipitation sequencing (ChIP-seq) and the Assay for Transposase-Accessible Chromatin with high-throughput sequencing (ATAC-seq) have become essential technologies to effectively measure protein–DNA interactions and chromatin accessibility. However, there is a need for a scalable and reproducible pipeline that incorporates proper normalization between samples, correction of copy number variations, and integration of new downstream analysis tools. Here we present Containerized Bioinformatics workflow for Reproducible ChIP/ATAC-seq Analysis (CoBRA), a modularized computational workflow which quantifies ChIP-seq and ATAC-seq peak regions and performs unsupervised and supervised analyses. CoBRA provides a comprehensive state-of-the-art ChIP-seq and ATAC-seq analysis pipeline that can be used by scientists with limited computational experience. This enables researchers to gain rapid insight into protein–DNA interactions and chromatin accessibility through sample clustering, differential peak calling, motif enrichment, comparison of sites to a reference database, and pathway analysis. CoBRA is publicly available online at https://bitbucket.org/cfce/cobra
染色质免疫共沉淀测序(ChIP-Seq)和染色质开放性检测(ATAC-seq) 已经成为有效检测蛋白质和DNA相互作用,染色质开放性的必要技术。然而,需要有一个可以准确的标准化样本,矫正拷贝数变异,以及整合更好的下游分析的流水线。这里,我们提供了CoBRA - 基于容器的生物信息学ChIP/ATA-seq分析流水线,它是一个模块化的分析流水线,可以对ChIP/ATA-seq数据进行监督和非监督的量化分析。CoBRA 为有限计算经历的科研人员提供了全面且先进的ChIP/ATA-seq分析. 这个分析流水线可以通过聚类,差异peak分析,motif富集分析,与现有数据库对比,基因功能分析让科研人员快速的分析蛋白质和DNA相互作用,染色质开放性。CoBRA 网站:https://bitbucket.org/cfce/cobra.

Page 652-661


Application Note

CVTree: A Parallel Alignment-free Phylogeny and Taxonomy Tool Based on Composition Vectors of Genomes

Guanghong Zuo

Composition Vector Tree (CVTree) is an alignment-free algorithm to infer phylogenetic relationships from genome sequences. It has been successfully applied to study phylogeny and taxonomy of viruses, prokaryotes, and fungi based on the whole genomes, as well as chloroplast genomes, mitochondrial genomes, and metagenomes. Here we presented the standalone software for the CVTree algorithm. In the software, an extensible parallel workflow for the CVTree algorithm was designed. Based on the workflow, new alignment-free methods were also implemented. And by examining the phylogeny and taxonomy of 13,903 prokaryotes based on 16S rRNA sequences, we showed that CVTree software is an efficient and effective tool for studying phylogeny and taxonomy based on genome sequences. The code of CVTree software can be available at https://github.com/ghzuo/cvtree.

Page 662-667