1. From Reads to Insights: Integrative Pipelines for Biological Interpretation of ATAC-seq Data
Ya Cui, Jason Sheng Li, Wei Li
2. SPA: A Quantitation Strategy for MS Data in Patient-derived Xenograft Models
Xi Cheng, Lili Qian, Bo Wang, Minjia Tan, Jing Li
With the development of mass spectrometry (MS)-based proteomics technologies, patient-derived xenograft (PDX), which is generated from the primary tumor of a patient, is widely used for the proteome-wide analysis of cancer mechanism and biomarker identification of a drug. However, the proteomics data interpretation is still challenging due to complex data deconvolution from the PDX sample that is a cross-species mixture of human cancerous tissues and immunodeficient mouse tissues. In this study, by using the lab-assembled mixture of human and mouse cells with different mixing ratios as a benchmark, we developed and evaluated a new method, SPA (shared peptide allocation), for protein quantitation by considering the unique and shared peptides of both species. The results showed that SPA could provide more convenient and accurate protein quantitation in human–mouse mixed samples. Further validation on a pair of gastric PDX samples (one bearing FGFR2 amplification while the other one not) showed that our new method not only significantly improved the overall protein identification, but also detected the differential phosphorylation of FGFR2 and its downstream mediators (such as RAS and ERK) exclusively. The tool pdxSPA is freely available at https://github.com/Li-Lab-Proteomics/pdxSPA.
3. RePhine: An Integrative Method for Identification of Drug Response-related Transcriptional Regulators
Xujun Wang, Zhengtao Zhang, Wenyi Qin, Shiyi Liu, Cong Liu, Georgi Z. Genchev, Lijian Hui, Hongyu Zhao, Hui Lu
Transcriptional regulators (TRs) participate in essential processes in cancer pathogenesis and are critical therapeutic targets. Identification of drug response-related TRs from cell line-based compound screening data is often challenging due to low mRNA abundance of TRs, protein modifications, and other confounders (CFs). In this study, we developed a regression-based pharmacogenomic and ChIP-seq data integration method (RePhine) to infer the impact of TRs on drug response through integrative analyses of pharmacogenomic and ChIP-seq data. RePhine was evaluated in simulation and pharmacogenomic data and was applied to pan-cancer datasets with the goal of biological discovery. In simulation data with added noises or CFs and in pharmacogenomic data, RePhine demonstrated an improved performance in comparison with three commonly used methods (including Pearson correlation analysis, logistic regression model, and gene set enrichment analysis). Utilizing RePhine and Cancer Cell Line Encyclopedia data, we observed that RePhine-derived TR signatures could effectively cluster drugs with different mechanisms of action. RePhine predicted that loss-of-function of EZH2/PRC2 reduces cancer cell sensitivity toward the BRAF inhibitor PLX4720. Experimental validation confirmed that pharmacological EZH2 inhibition increases the resistance of cancer cells to PLX4720 treatment. Our results support that RePhine is a useful tool for inferring drug response-related TRs and for potential therapeutic applications. The source code for RePhine is freely available at https://github.com/coexps/RePhine.
4. NOGEA: A Network-oriented Gene Entropy Approach for Dissecting Disease Comorbidity and Drug Repositioning
Zihu Guo, Yingxue Fu, Chao Huang, Chunli Zheng, Ziyin Wu, Xuetong Chen, Shuo Gao, Yaohua Ma, Mohamed Shahen, Yan Li, Pengfei Tu, Jingbo Zhu, Zhenzhong Wang, Wei Xiao, Yonghua Wang
Rapid development of high-throughput technologies has permitted the identification of an increasing number of disease-associated genes (DAGs), which are important for understanding disease initiation and developing precision therapeutics. However, DAGs often contain large amounts of redundant or false positive information, leading to difficulties in quantifying and prioritizing potential relationships between these DAGs and human diseases. In this study, a network-oriented gene entropy approach (NOGEA) is proposed for accurately inferring master genes that contribute to specific diseases by quantitatively calculating their perturbation abilities on directed disease-specific gene networks. In addition, we confirmed that the master genes identified by NOGEA have a high reliability for predicting disease-specific initiation events and progression risk. Master genes may also be used to extract the underlying information of different diseases, thus revealing mechanisms of disease comorbidity. More importantly, approved therapeutic targets are topologically localized in a small neighborhood of master genes in the interactome network, which provides a new way for predicting drug-disease associations. Through this method, 11 old drugs were newly identified and predicted to be effective for treating pancreatic cancer and then validated by in vitro experiments. Collectively, the NOGEA was useful for identifying master genes that control disease initiation and co-occurrence, thus providing a valuable strategy for drug efficacy screening and repositioning. NOGEA codes are publicly available at https://github.com/guozihuaa/NOGEA.
5. DeepCAPE: A Deep Convolutional Neural Network for the Accurate Prediction of Enhancers
Shengquan Chen; Mingxin Gan; Hairong Lv; Rui Jiang
The establishment of a landscape of enhancers across human cells is crucial to deciphering the mechanism of gene regulation, cell differentiation, and disease development. High-throughput experimental approaches, which contain successfully reported enhancers in typical cell lines, are still too costly and time-consuming to perform systematic identification of enhancers specific to different cell lines. Existing computational methods, capable of predicting regulatory elements purely relying on DNA sequences, lack the power of cell line-specific screening. Recent studies have suggested that chromatin accessibility of a DNA segment is closely related to its potential function in regulation, and thus may provide useful information in identifying regulatory elements. Motivated by the aforementioned understanding, we integrate DNA sequences and chromatin accessibility data to accurately predict enhancers in a cell line-specific manner. We proposed DeepCAPE, a deep convolutional neural network to predict enhancers via the integration of DNA sequences and DNase-seq data. Benefitting from the well-designed feature extraction mechanism and skip connection strategy, our model not only consistently outperforms existing methods in the imbalanced classification of cell line-specific enhancers against background sequences, but also has the ability to self-adapt to different sizes of datasets. Besides, with the adoption of auto-encoder, our model is capable of making cross-cell line predictions. We further visualize kernels of the first convolutional layer and show the match of identified sequence signatures and known motifs. We finally demonstrate the potential ability of our model to explain functional implications of putative disease-associated genetic variants and discriminate disease-related enhancers. The source code and detailed tutorial of DeepCAPE are freely available at https://github.com/ShengquanChen/DeepCAPE.
6. The Genome Sequence Archive Family: Toward Explosive Data Growth and Diverse Data Types
Tingting Chen, Xu Chen, Sisi Zhang, Junwei Zhu, Bixia Tang, Anke Wang, Lili Dong, Zhewen Zhang, Caixia Yu, Yanling Sun, Lianjiang Chi, Huanxin Chen, Shuang Zhai, Yubin Sun, Li Lan, Xin Zhang, Jingfa Xiao, Yiming Bao, Yanqing Wang, Zhang Zhang, Wenming Zhao
The Genome Sequence Archive (GSA) is a data repository for archiving raw sequence data, which provides data storage and sharing services for worldwide scientific communities. Considering explosive data growth with diverse data types, here we present the GSA family by expanding into a set of resources for raw data archive with different purposes, namely, GSA (https://ngdc.cncb.ac.cn/gsa/), GSA for Human (GSA-Human, https://ngdc.cncb.ac.cn/gsa-human/), and Open Archive for Miscellaneous Data (OMIX, https://ngdc.cncb.ac.cn/omix/). Compared with the 2017 version, GSA has been significantly updated in data model, online functionalities, and web interfaces. GSA-Human, as a new partner of GSA, is a data repository specialized in human genetics-related data with controlled access and security. OMIX, as a critical complement to the two resources mentioned above, is an open archive for miscellaneous data. Together, all these resources form a family of resources dedicated to archiving explosive data with diverse types, accepting data submissions from all over the world, and providing free open access to all publicly available data in support of worldwide research activities.
组学原始数据归档库Genome Sequence Archive(GSA) 是生命组学原始测序数据管理的公益性数据库，旨在推动全球生命组学数据的共享与应用。近年来，随着组学数据的爆炸性增长和数据类型的多样化，以及人类遗传资源数据管理的特殊需求，我们对GSA数据库进行了更新和扩展，形成一个 GSA 数据资源库家族(GSA Family)，包括 GSA (https://ngdc.cncb.ac.cn/gsa/)，GSA for Human (GSA-Human, https://ngdc.cncb.ac.cn/gsa-human/) 和 Open Archive for Miscellaneous Data (OMIX, https://ngdc.cncb.ac.cn/omix/)。 GSA数据库与 2017发布的版本相比，在数据模型、系统功能和数据提交方式等方面都进行了更新；GSA-Human是一个专门用于存储人类遗传资源数据的数据库，可实现人类遗传资源数据的受控访问，保障人类遗传资源数据的安全性；OMIX 数据库是一个用于存储非原始测序数据的归档库，如环境组、表型组、代谢组等，它作为上述两种数据资源库的重要补充，有效地解决了用户对提交除原始测序数据外的其它类型数据存储需求。GSA Family各数据资源库致力于汇交和管理各种类型的数据，接受来自全世界的科研工作者的数据提交，并对所有公开可用数据提供免费开放访问，以支持全球的生命科学研究活动。
7. Genome Warehouse: A Public Repository Housing Genome-scale Data
Meili Chen, Yingke Ma, Song Wu, Xinchang Zheng, Hongen Kang, Jian Sang, Xingjian Xu, Lili Hao, Zhaohua Li, Zheng Gong, Jingfa Xiao, Zhang Zhang, Wenming Zhao, Yiming Bao
The Genome Warehouse (GWH) is a public repository housing genome assembly data for a wide range of species and delivering a series of web services for genome data submission, storage, release, and sharing. As one of the core resources in the National Genomics Data Center (NGDC), part of the China National Center for Bioinformation (CNCB; https://ngdc.cncb.ac.cn), GWH accepts both full and partial (chloroplast, mitochondrion, and plasmid) genome sequences with different assembly levels, as well as an update of existing genome assemblies. For each assembly, GWH collects detailed genome-related metadata of biological project, biological sample, and genome assembly, in addition to genome sequence and annotation. To archive high-quality genome sequences and annotations, GWH is equipped with a uniform and standardized procedure for quality control. Besides basic browse and search functionalities, all released genome sequences and annotations can be visualized with JBrowse. By May 21, 2021, GWH has received 19,124 direct submissions covering a diversity of 1108 species and has released 8772 of them. Collectively, GWH serves as an important resource for genome-scale data management and provides free and publicly accessible data to support research activities throughout the world. GWH is publicly accessible at https://ngdc.cncb.ac.cn/gwh.
基因组数据库（Genome Warehouse, GWH）是存储多物种基因组拼接数据并允许公开访问的资源库，它提供基因组数据的汇交、存储、发布和共享等一系列web服务。作为国家生物信息中心（CNCB）、国家基因组科学数据中心（NGDC）的一个核心资源，GWH接受不同组装级别的完整基因组和部分基因组（叶绿体基因组、线粒体基因组、质粒基因组）序列的汇交，以及对已有基因组拼接数据的更新。对于每一个基因组拼接，除了基因组序列和注释外，GWH还收集详细的基因组相关的元数据（包括生物项目、生物样本和基因组拼接的元数据）。GWH配套了一个统一且标准化的质量控制流程，用于归档高质量的序列和注释。GWH除了提供浏览、检索等基本功能外，同时已发布的基因组序列和注释数据可以通过JBrowse进行可视化。截至2021年5月21日，GWH已经收到了用户直接汇交的19,124个基因组拼接数据，涵盖1108个物种，并且已经发布了其中的8772个拼接数据。综上所述，GWH是一个管理大规模基因组数据的重要资源，并面向全球科研人员提供免费、可公开获取的基因组数据。GWH可以通过https://ngdc.cncb.ac.cn/gwh公开访问。
8. REVA as A Well-curated Database for Human Expression-modulating Variants
Yu Wang, Fang-Yuan Shi, Yu Liang, Ge Gao
More than 90% of disease- and trait-associated human variants are noncoding. By systematically screening multiple large-scale studies, we compiled REVA, a manually curated database for over 11.8 million experimentally tested noncoding variants with expression-modulating potentials. We provided 2424 functional annotations that could be used to pinpoint the plausible regulatory mechanism of these variants. We further benchmarked multiple state-of-the-art computational tools and found that their limited sensitivity remains a serious challenge for effective large-scale analysis. REVA provides high-quality experimentally tested expression-modulating variants with extensive functional annotations, which will be useful for users in the noncoding variant community. REVA is freely available at http://reva.gao-lab.org.
9. SmProt: A Reliable Repository with Comprehensive Annotation of Small Proteins Identified from Ribosome Profiling
Yanyan Li, Honghong Zhou, Xiaomin Chen, Yu Zheng, Quan Kang, Di Hao, Lili Zhang, Tingrui Song, Huaxia Luo, Yajing Hao, Runsheng Chen, Peng Zhang, Shunmin He
Small proteins specifically refer to proteins consisting of less than 100 amino acids translated from small open reading frames (sORFs), which were usually missed in previous genome annotation. The significance of small proteins has been revealed in current years, along with the discovery of their diverse functions. However, systematic annotation of small proteins is still insufficient. SmProt was specially developed to provide valuable information on small proteins for scientific community. Here we present the update of SmProt, which emphasizes reliability of translated sORFs, genetic variants in translated sORFs, disease-specific sORF translation events or sequences, and remarkably increased data volume. More components such as non-ATG translation initiation, function, and new sources are also included. SmProt incorporated 638,958 unique small proteins curated from 3,165,229 primary records, which were computationally predicted from 419 ribosome profiling (Ribo-seq) datasets or collected from literature and other sources from 370 cell lines or tissues in 8 species (Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Danio rerio, Saccharomyces cerevisiae, Caenorhabditis elegans, and Escherichia coli). In addition, small protein families identified from human microbiomes were also collected. All datasets in SmProt are free to access, and available for browse, search, and bulk downloads at http://bigdata.ibp.ac.cn/SmProt/.
小蛋白数据库 (Small Protein Repository, SmProt) 的构建。
小开放阅读框 (small open reading frame, sORFs) 广泛存在于人类等许多生物的基因组中，其可位于编码区、mRNA 的非翻译区 (untranslated regions, UTR) 以及多种非编码 RNA (non-coding RNA, ncRNA) 上，部分能够翻译成小蛋白。研究发现小蛋白在多种生物学过程中行使功能，并与多种疾病相关，但在以往的基因组注释中通常被忽略。SmProt致力于整合多来源的小蛋白信息，尤其翻译自lncRNA与UTR的小蛋白。
基于对公共核糖体图谱 (Ribosome profiling, Ribo-seq) 数据的广泛收集、严格质控与重新分析，对已发表文献、数据库的信息挖掘，对多来源信息的交叉整合，对结果的合并去冗余，对数据框架的重构，作者团队发布了全新的SmProt，提供更加系统、丰富、准确的小蛋白注释，相关信息均允许高效的在线浏览、检索、可视化、BLAST、下载。
10. OGP: A Repository of Experimentally Characterized O-glycoproteins to Facilitate Studies on O-glycosylation
Jiangming Huang, Mengxi Wu, Yang Zhang, Siyuan Kong, Mingqi Liu, Biyun Jiang, Pengyuan Yang, Weiqian Cao
Numerous studies on cancers, biopharmaceuticals, and clinical trials have necessitated comprehensive and precise analysis of protein O-glycosylation. However, the lack of updated and convenient databases deters the storage of and reference to emerging O-glycoprotein data. To resolve this issue, an O-glycoprotein repository named OGP was established in this work. It was constructed with a collection of O-glycoprotein data from different sources. OGP contains 9354 O-glycosylation sites and 11,633 site-specific O-glycans mapping to 2133 O-glycoproteins, and it is the largest O-glycoprotein repository thus far. Based on the recorded O-glycosylation sites, an O-glycosylation site prediction tool was developed. Moreover, an OGP-based website is already available (https://www.oglyp.org/). The website comprises four specially designed and user-friendly modules: statistical analysis, database search, site prediction, and data submission. The first version of OGP repository and the website allow users to obtain various O-glycoprotein-related information, such as protein accession Nos., O-glycosylation sites, O-glycopeptide sequences, site-specific O-glycan structures, experimental methods, and potential O-glycosylation sites. O-glycosylation data mining can be performed efficiently on this website, which will greatly facilitate related studies. In addition, the database is accessible from OGP website (https://www.oglyp.org/download.php).
11. rMVP: A Memory-efficient, Visualization-enhanced, and Parallel-accelerated Tool for Genome-wide Association Study
Lilin Yin, Haohao Zhang, Zhenshuang Tang, Jingya Xu, Dong Yin, Zhiwu Zhang, Xiaohui Yuan, Mengjin Zhu, Shuhong Zhao, Xinyun Li, Xiaolei Liu
Along with the development of high-throughput sequencing technologies, both sample size and SNP number are increasing rapidly in genome-wide association studies (GWAS), and the associated computation is more challenging than ever. Here, we present a memory-efficient, visualization-enhanced, and parallel-accelerated R package called “rMVP” to address the need for improved GWAS computation. rMVP can 1) effectively process large GWAS data, 2) rapidly evaluate population structure, 3) efficiently estimate variance components by Efficient Mixed-Model Association eXpedited (EMMAX), Factored Spectrally Transformed Linear Mixed Models (FaST-LMM), and Haseman-Elston (HE) regression algorithms, 4) implement parallel-accelerated association tests of markers using general linear model (GLM), mixed linear model (MLM), and fixed and random model circulating probability unification (FarmCPU) methods, 5) compute fast with a globally efficient design in the GWAS processes, and 6) generate various visualizations of GWAS-related information. Accelerated by block matrix multiplication strategy and multiple threads, the association test methods embedded in rMVP are significantly faster than PLINK, GEMMA, and FarmCPU_pkg. rMVP is freely available at https://github.com/xiaolei-lab/rMVP.
12. GAPIT Version 3: Boosting Power and Accuracy for Genomic Association and Prediction
Jiabo Wang, Zhiwu Zhang
Genome-wide association study (GWAS) and genomic prediction/selection (GP/GS) are the two essential enterprises in genomic research. Due to the great magnitude and complexity of genomic and phenotypic data, analytical methods and their associated software packages are frequently advanced. GAPIT is a widely-used genomic association and prediction integrated tool as an R package. The first version was released to the public in 2012 with the implementation of the general linear model (GLM), mixed linear model (MLM), compressed MLM (CMLM), and genomic best linear unbiased prediction (gBLUP). The second version was released in 2016 with several new implementations, including enriched CMLM (ECMLM) and settlement of MLMs under progressively exclusive relationship (SUPER). All the GWAS methods are based on the single-locus test. For the first time, in the current release of GAPIT, version 3 implemented three multi-locus test methods, including multiple loci mixed model (MLMM), fixed and random model circulating probability unification (FarmCPU), and Bayesian-information and linkage-disequilibrium iteratively nested keyway (BLINK). Additionally, two GP/GS methods were implemented based on CMLM (named compressed BLUP; cBLUP) and SUPER (named SUPER BLUP; sBLUP). These new implementations not only boost statistical power for GWAS and prediction accuracy for GP/GS, but also improve computing speed and increase the capacity to analyze big genomic data. Here, we document the current upgrade of GAPIT by describing the selection of the recently developed methods, their implementations, and potential impact. All documents, including source code, user manual, demo data, and tutorials, are freely available at the GAPIT website (http://zzlab.net/GAPIT).
全基因组关联分析与预测软件（GAPIT version 3）的构建。
13. AIAP: A Quality Control and Integrative Analysis Package to Improve ATAC-seq Data Analysis
ShaopengLiu1#DaofengLi2#ChengLyu1#Paul M. Gontarz, Benpeng Miao, Pamela A.F. Madden, Ting Wang, Bo Zhang
Assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) is a technique widely used to investigate genome-wide chromatin accessibility. The recently published Omni-ATAC-seq protocol substantially improves the signal/noise ratio and reduces the input cell number. High-quality data are critical to ensure accurate analysis. Several tools have been developed for assessing sequencing quality and insertion size distribution for ATAC-seq data; however, key quality control (QC) metrics have not yet been established to accurately determine the quality of ATAC-seq data. Here, we optimized the analysis strategy for ATAC-seq and defined a series of QC metrics for ATAC-seq data, including reads under peak ratio (RUPr), background (BG), promoter enrichment (ProEn), subsampling enrichment (SubEn), and other measurements. We incorporated these QC tests into our recently developed ATAC-seq Integrative Analysis Package (AIAP) to provide a complete ATAC-seq analysis system, including quality assurance, improved peak calling, and downstream differential analysis. We demonstrated a significant improvement of sensitivity (20%–60%) in both peak calling and differential analysis by processing paired-end ATAC-seq datasets using AIAP. AIAP is compiled into Docker/Singularity, and it can be executed by one command line to generate a comprehensive QC report. We used ENCODE ATAC-seq data to benchmark and generate QC recommendations, and developed qATACViewer for the user-friendly interaction with the QC report. The software, source code, and documentation of AIAP are freely available at https://github.com/Zhang-lab/ATAC-seq_QC_analysis.
ATAC-seq是利用Tn5转座酶研究染色质开放性(chromatin accessibility)的高通量测序技术。随着ATAC-seq技术的发展，它已逐渐成为研究染色质开放性的主流方法，因此开发ATAC-seq数据分析软件显得尤为必要，而现有的ATAC-seq分析工具无法对ATAC-seq数据进行系统性的分析。为了满足这一需要，我们对常用的ATAC-seq数据分析方式进行了优化，并结合基准化分析建立了一系列评估ATAC-seq数据质量的标准，包括峰值信号比例(RUPr)、背景噪音(BG)、启动子信号丰度(ProEn)、抽样信号丰度(SubEn)等。我们将这些评估标准、分析流程与现有的工具与算法（如BWA, MethylQA, MACS2, DESeq2）整合，研发了能够对ATAC-seq数据进行系统性分析的软件AIAP。AIAP可以通过Docker/Singularity镜像进行编译，通过单命令行指令，AIAP可以对任意ATAC-seq数据进行分析并生成综合性的质量评估报告。进一步的测试表明，AIAP对ATCA-seq数据峰值的检测(peak calling)和对差异性开放区域(differentially accessible region) 检测的灵敏度提升了20%~60%。我们利用AIAP软件分析了ENCODE ATAC-seq数据，据此建立了对ATAC-seq数据质量评估的参考标准。所有的数据质量报告和参考标准都可以通过我们开发的软件qATACViewer进行可视化比对。最后，AIAP代码已经在GitHub开源。
14. CoBRA: Containerized Bioinformatics Workflow for Reproducible ChIP/ATAC-seq Analysis
Xintao Qiu, Avery S. Feit, Ariel Feiglin, Yingtian Xie, Nikolas Kesten, Len Taing, Joseph Perkins, Shengqing Gu, Yihao Li, Paloma Cejas, Ningxuan Zhou, Rinath Jeselsohn, Myles Brown, X. Shirley Liu, Henry W. Long
Chromatin immunoprecipitation sequencing (ChIP-seq) and the Assay for Transposase-Accessible Chromatin with high-throughput sequencing (ATAC-seq) have become essential technologies to effectively measure protein–DNA interactions and chromatin accessibility. However, there is a need for a scalable and reproducible pipeline that incorporates proper normalization between samples, correction of copy number variations, and integration of new downstream analysis tools. Here we present Containerized Bioinformatics workflow for Reproducible ChIP/ATAC-seq Analysis (CoBRA), a modularized computational workflow which quantifies ChIP-seq and ATAC-seq peak regions and performs unsupervised and supervised analyses. CoBRA provides a comprehensive state-of-the-art ChIP-seq and ATAC-seq analysis pipeline that can be used by scientists with limited computational experience. This enables researchers to gain rapid insight into protein–DNA interactions and chromatin accessibility through sample clustering, differential peak calling, motif enrichment, comparison of sites to a reference database, and pathway analysis. CoBRA is publicly available online at https://bitbucket.org/cfce/cobra
染色质免疫共沉淀测序（ChIP-Seq）和染色质开放性检测（ATAC-seq） 已经成为有效检测蛋白质和DNA相互作用，染色质开放性的必要技术。然而，需要有一个可以准确的标准化样本，矫正拷贝数变异，以及整合更好的下游分析的流水线。这里，我们提供了CoBRA - 基于容器的生物信息学ChIP/ATA-seq分析流水线，它是一个模块化的分析流水线，可以对ChIP/ATA-seq数据进行监督和非监督的量化分析。CoBRA 为有限计算经历的科研人员提供了全面且先进的ChIP/ATA-seq分析. 这个分析流水线可以通过聚类，差异peak分析，motif富集分析，与现有数据库对比，基因功能分析让科研人员快速的分析蛋白质和DNA相互作用，染色质开放性。CoBRA 网站：https://bitbucket.org/cfce/cobra.
15. CVTree: A Parallel Alignment-free Phylogeny and Taxonomy Tool Based on Composition Vectors of Genomes
Composition Vector Tree (CVTree) is an alignment-free algorithm to infer phylogenetic relationships from genome sequences. It has been successfully applied to study phylogeny and taxonomy of viruses, prokaryotes, and fungi based on the whole genomes, as well as chloroplast genomes, mitochondrial genomes, and metagenomes. Here we presented the standalone software for the CVTree algorithm. In the software, an extensible parallel workflow for the CVTree algorithm was designed. Based on the workflow, new alignment-free methods were also implemented. And by examining the phylogeny and taxonomy of 13,903 prokaryotes based on 16S rRNA sequences, we showed that CVTree software is an efficient and effective tool for studying phylogeny and taxonomy based on genome sequences. The code of CVTree software can be available at https://github.com/ghzuo/cvtree.