1. Artificial Intelligence in Omics
Feng Gao, Kun Huang, Yi Xing
2. Application of Deep Learning on Single-cell RNA Sequencing Data Analysis: A Review
Matthew Brende, Chang Su, Zilong Bai, Hao Zhang, Olivier Elemento, Fei Wang
Single-cell RNA sequencing (scRNA-seq) has become a routinely used technique to quantify the gene expression profile of thousands of single cells simultaneously. Analysis of scRNA-seq data plays an important role in the study of cell states and phenotypes, and has helped elucidate biological processes, such as those occurring during the development of complex organisms, and improved our understanding of disease states, such as cancer, diabetes, and coronavirus disease 2019 (COVID-19). Deep learning, a recent advance of artificial intelligence that has been used to address many problems involving large datasets, has also emerged as a promising tool for scRNA-seq data analysis, as it has a capacity to extract informative and compact features from noisy, heterogeneous, and high-dimensional scRNA-seq data to improve downstream analysis. The present review aims at surveying recently developed deep learning techniques in scRNA-seq data analysis, identifying key steps within the scRNA-seq data analysis pipeline that have been advanced by deep learning, and explaining the benefits of deep learning over more conventional analytic tools. Finally, we summarize the challenges in current deep learning approaches faced within scRNA-seq data and discuss potential directions for improvements in deep learning algorithms for scRNA-seq data analysis.
3. Computational Methods for Single-cell Multi-omics Integration and Alignment
Stefan Stanojevic, Yijun Li, Aleksandar Ristivojevic, Lana X. Garmire
Recently developed technologies to generate single-cell genomic data have made a revolutionary impact in the field of biology. Multi-omics assays offer even greater opportunities to understand cellular states and biological processes. The problem of integrating different omics data with very different dimensionality and statistical properties remains, however, quite challenging. A growing body of computational tools is being developed for this task, leveraging ideas ranging from machine translation to the theory of networks, and represents another frontier on the interface of biology and data science. Our goal in this review is to provide a comprehensive, up-to-date survey of computational techniques for the integration of single-cell multi-omics data, while making the concepts behind each algorithm approachable to a non-expert audience.
4. Machine Learning for Lung Cancer Diagnosis, Treatment, and Prognosis
Yawei Li, Xin Wu, Ping Yang, Guoqian Jiang, Yuan Luo
The recent development of imaging and sequencing technologies enables systematic advances in the clinical study of lung cancer. Meanwhile, the human mind is limited in effectively handling and fully utilizing the accumulation of such enormous amounts of data. Machine learning-based approaches play a critical role in integrating and analyzing these large and complex datasets, which have extensively characterized lung cancer through the use of different perspectives from these accrued data. In this review, we provide an overview of machine learning-based approaches that strengthen the varying aspects of lung cancer diagnosis and therapy, including early detection, auxiliary diagnosis, prognosis prediction, and immunotherapy practice. Moreover, we highlight the challenges and opportunities for future applications of machine learning in lung cancer.
5. Microbial Dark Matter: from Discovery to Applications
Yuguo Zha, Hui Chong, Pengshuo Yang, Kang Ning
With the rapid increase of the microbiome samples and sequencing data, more and more knowledge about microbial communities has been gained. However, there is still much more to learn about microbial communities, including billions of novel species and genes, as well as countless spatiotemporal dynamic patterns within the microbial communities, which together form the microbial dark matter. In this work, we summarized the dark matter in microbiome research and reviewed current data mining methods, especially artificial intelligence (AI) methods, for different types of knowledge discovery from microbial dark matter. We also provided case studies on using AI methods for microbiome data mining and knowledge discovery. In summary, we view microbial dark matter not as a problem to be solved but as an opportunity for AI methods to explore, with the goal of advancing our understanding of microbial communities, as well as developing better solutions to global concerns about human health and the environment.
6. Machine Learning Modeling of Protein-intrinsic Features Predicts Tractability of Targeted Protein Degradation
Wubing Zhang, Shourya S. Roy Burman, Jiaye Chen, Katherine A. Donovan, Yang Cao, Chelsea Shu, Boning Zhang, Zexian Zeng, Shengqing Gu, Yi Zhang, Dian Li, Eric S. Fischer, Collin Tokheim, X. Shirley Liu
Targeted protein degradation (TPD) has rapidly emerged as a therapeutic modality to eliminate previously undruggable proteins by repurposing the cell’s endogenous protein degradation machinery. However, the susceptibility of proteins for targeting by TPD approaches, termed “degradability”, is largely unknown. Here, we developed a machine learning model, model-free analysis of protein degradability (MAPD), to predict degradability from features intrinsic to protein targets. MAPD shows accurate performance in predicting kinases that are degradable by TPD compounds [with an area under the precision–recall curve (AUPRC) of 0.759 and an area under the receiver operating characteristic curve (AUROC) of 0.775] and is likely generalizable to independent non-kinase proteins. We found five features with statistical significance to achieve optimal prediction, with ubiquitination potential being the most predictive. By structural modeling, we found that E2-accessible ubiquitination sites, but not lysine residues in general, are particularly associated with kinase degradability. Finally, we extended MAPD predictions to the entire proteome to find 964 disease-causing proteins (including proteins encoded by 278 cancer genes) that may be tractable to TPD drug development.
7. Assessment and Optimization of Explainable Machine Learning Models Applied to Transcriptomic Data
Yongbing Zhao, Jinfeng Shao, Yan W. Asmann
Explainable artificial intelligence aims to interpret how machine learning models make decisions, and many model explainers have been developed in the computer vision field. However, understanding of the applicability of these model explainers to biological data is still lacking. In this study, we comprehensively evaluated multiple explainers by interpreting pre-trained models for predicting tissue types from transcriptomic data and by identifying the top contributing genes from each sample with the greatest impacts on model prediction. To improve the reproducibility and interpretability of results generated by model explainers, we proposed a series of optimization strategies for each explainer on two different model architectures of multilayer perceptron (MLP) and convolutional neural network (CNN). We observed three groups of explainer and model architecture combinations with high reproducibility. Group II, which contains three model explainers on aggregated MLP models, identified top contributing genes in different tissues that exhibited tissue-specific manifestation and were potential cancer biomarkers. In summary, our work provides novel insights and guidance for exploring biological mechanisms using explainable machine learning models.
8. SOPHIE: Generative Neural Networks Separate Common and Specific Transcriptional Responses
Alexandra J. Lee, Dallas L. Mould, Jake Crawford, Dongbo Hu, Rani K. Powers, Georgia Doing, James C. Costello, Deborah A. Hogan, Casey S. Greene
Genome-wide transcriptome profiling identifies genes that are prone to differential expression (DE) across contexts, as well as genes with changes specific to the experimental manipulation. Distinguishing genes that are specifically changed in a context of interest from common differentially expressed genes (DEGs) allows more efficient prediction of which genes are specific to a given biological process under scrutiny. Currently, common DEGs or pathways can only be identified through the laborious manual curation of experiments, an inordinately time-consuming endeavor. Here we pioneer an approach, Specific cOntext Pattern Highlighting In Expression data (SOPHIE), for distinguishing between common and specific transcriptional patterns using a generative neural network to create a background set of experiments from which a null distribution of gene and pathway changes can be generated. We apply SOPHIE to diverse datasets including those from human, human cancer, and bacterial pathogen Pseudomonas aeruginosa. SOPHIE identifies common DEGs in concordance with previously described, manually and systematically determined common DEGs. Further molecular validation indicates that SOPHIE detects highly specific but low-magnitude biologically relevant transcriptional changes. SOPHIE’s measure of specificity can complement log2 fold change values generated from traditional DE analyses. For example, by filtering the set of DEGs, one can identify genes that are specifically relevant to the experimental condition of interest. Consequently, these results can inform future research directions. All scripts used in these analyses are available at https://github.com/greenelab/generic-expression-patterns. Users can access https://github.com/greenelab/sophie to run SOPHIE on their own data.
9. DGMP: Identifying Cancer Driver Genes by Jointing DGCN and MLP from Multi-omics Genomic Data
Shao-Wu Zhang, Jing-Yu Xu, Tong Zhang
Identification of cancer driver genes plays an important role in precision oncology research, which is helpful to understand cancer initiation and progression. However, most existing computational methods mainly used the protein–protein interaction (PPI) networks, or treated the directed gene regulatory networks (GRNs) as the undirected gene–gene association networks to identify the cancer driver genes, which will lose the unique structure regulatory information in the directed GRNs, and then affect the outcome of the cancer driver gene identification. Here, based on the multi-omics pan-cancer data (i.e., gene expression, mutation, copy number variation, and DNA methylation), we propose a novel method (called DGMP) to identify cancer driver genes by jointing directed graph convolutional network (DGCN) and multilayer perceptron (MLP). DGMP learns the multi-omics features of genes as well as the topological structure features in GRN with the DGCN model and uses MLP to weigh more on gene features for mitigating the bias toward the graph topological features in the DGCN learning process. The results on three GRNs show that DGMP outperforms other existing state-of-the-art methods. The ablation experimental results on the DawnNet network indicate that introducing MLP into DGCN can offset the performance degradation of DGCN, and jointing MLP and DGCN can effectively improve the performance of identifying cancer driver genes. DGMP can identify not only the highly mutated cancer driver genes but also the driver genes harboring other kinds of alterations (e.g., differential expression and aberrant DNA methylation) or genes involved in GRNs with other cancer genes. The source code of DGMP can be freely downloaded from https://github.com/NWPU-903PR/DGMP.
10. scEMAIL: Universal and Source-free Annotation Method for scRNA-seq Data with Novel Cell-type Perception
Hui Wan, Liang Chen, Minghua Deng
Current cell-type annotation tools for single-cell RNA sequencing (scRNA-seq) data mainly utilize well-annotated source data to help identify cell types in target data. However, on account of privacy preservation, their requirements for raw source data may not always be satisfied. In this case, achieving feature alignment between source and target data explicitly is impossible. Additionally, these methods are barely able to discover the presence of novel cell types. A subjective threshold is often selected by users to detect novel cells. We propose a universal annotation framework for scRNA-seq data called scEMAIL, which automatically detects novel cell types without accessing source data during adaptation. For new cell-type identification, a novel cell-type perception module is designed with three steps. First, an expert ensemble system measures uncertainty of each cell from three complementary aspects. Second, based on this measurement, bimodality tests are applied to detect the presence of new cell types. Third, once assured of their presence, an adaptive threshold via manifold mixup partitions target cells into “known” and “unknown” groups. Model adaptation is then conducted to alleviate the batch effect. We gather multi-order neighborhood messages globally and impose local affinity regularizations on “known” cells. These constraints mitigate wrong classifications of the source model via reliable self-supervised information of neighbors. scEMAIL is accurate and robust under various scenarios in both simulation and real data. It is also flexible to be applied to challenging single-cell ATAC-seq data without loss of superiority. The source code of scEMAIL can be accessed at https://github.com/aster-ww/scEMAIL and https://ngdc.cncb.ac.cn/biocode/tools/BT007335/releases/v1.0.
11. Annotating TSSs in Multiple Cell Types Based on DNA Sequence and RNA-seq Data via DeeReCT-TSS
Juexiao Zhou, Bin Zhang, Haoyang Li, Longxi Zhou, Zhongxiao Li, Yongkang Long, Wenkai Han, Mengran Wang, Huanhuan Cui, Jingjing Li, Wei Chen, Xin Gao
The accurate annotation of transcription start sites (TSSs) and their usage are critical for the mechanistic understanding of gene regulation in different biological contexts. To fulfill this, specific high-throughput experimental technologies have been developed to capture TSSs in a genome-wide manner, and various computational tools have also been developed for in silico prediction of TSSs solely based on genomic sequences. Most of these computational tools cast the problem as a binary classification task on a balanced dataset, thus resulting in drastic false positive predictions when applied on the genome scale. Here, we present DeeReCT-TSS, a deep learning-based method that is capable of identifying TSSs across the whole genome based on both DNA sequence and conventional RNA sequencing data. We show that by effectively incorporating these two sources of information, DeeReCT-TSS significantly outperforms other solely sequence-based methods on the precise annotation of TSSs used in different cell types. Furthermore, we develop a meta-learning-based extension for simultaneous TSS annotations on 10 cell types, which enables the identification of cell type-specific TSSs. Finally, we demonstrate the high precision of DeeReCT-TSS on two independent datasets by correlating our predicted TSSs with experimentally defined TSS chromatin states. The source code for DeeReCT-TSS is available at https://github.com/JoshuaChou2018/DeeReCT-TSS_release and https://ngdc.cncb.ac.cn/biocode/tools/BT007316.
12. TIST: Transcriptome and Histopathological Image Integrative Analysis for Spatial Transcriptomics
Yiran Shan, Qian Zhang, Wenbo Guo, Yanhong Wu, Yuxin Miao, Hongyi Xin, Qiuyu Lian, Jin Gu
Sequencing-based spatial transcriptomics (ST) is an emerging technology to study in situ gene expression patterns at the whole-genome scale. Currently, ST data analysis is still complicated by high technical noises and low resolution. In addition to the transcriptomic data, matched histopathological images are usually generated for the same tissue sample along the ST experiment. The matched high-resolution histopathological images provide complementary cellular phenotypical information, providing an opportunity to mitigate the noises in ST data. We present a novel ST data analysis method called transcriptome and histopathological image integrative analysis for ST (TIST), which enables the identification of spatial clusters (SCs) and the enhancement of spatial gene expression patterns by integrative analysis of matched transcriptomic data and images. TIST devises a histopathological feature extraction method based on Markov random field (MRF) to learn the cellular features from histopathological images, and integrates them with the transcriptomic data and location information as a network, termed TIST-net. Based on TIST-net, SCs are identified by a random walk-based strategy, and gene expression patterns are enhanced by neighborhood smoothing. We benchmark TIST on both simulated datasets and 32 real samples against several state-of-the-art methods. Results show that TIST is robust to technical noises on multiple analysis tasks for sequencing-based ST data and can find interesting microstructures in different biological scenarios. TIST is available at http://lifeome.net/software/tist/ and https://ngdc.cncb.ac.cn/biocode/tools/BT007317.
13. DeepNoise: Signal and Noise Disentanglement Based on Classifying Fluorescent Microscopy Images via Deep Learning
Sen Yang, Tao Shen, Yuqi Fang, Xiyue Wang, Jun Zhang, Wei Yang, Junzhou Huang, Xiao Han
The high-content image-based assay is commonly leveraged for identifying the phenotypic impact of genetic perturbations in biology field. However, a persistent issue remains unsolved during experiments: the interferential technical noises caused by systematic errors (e.g., temperature, reagent concentration, and well location) are always mixed up with the real biological signals, leading to misinterpretation of any conclusion drawn. Here, we reported a mean teacher-based deep learning model (DeepNoise) that can disentangle biological signals from the experimental noises. Specifically, we aimed to classify the phenotypic impact of 1108 different genetic perturbations screened from 125,510 fluorescent microscopy images, which were totally unrecognizable by the human eye. We validated our model by participating in the Recursion Cellular Image Classification Challenge, and DeepNoise achieved an extremely high classification score (accuracy: 99.596%), ranking the 2nd place among 866 participating groups. This promising result indicates the successful separation of biological and technical factors, which might help decrease the cost of treatment development and expedite the drug discovery process. The source code of DeepNoise is available at https://github.com/Scu-sen/Recursion-Cellular-Image-Classification-Challenge.
14. NetBCE: An Interpretable Deep Neural Network for Accurate Prediction of Linear B-cell Epitopes
Haodong Xu, Zhongming Zhao
Identification of B-cell epitopes (BCEs) plays an essential role in the development of peptide vaccines and immuno-diagnostic reagents, as well as antibody design and production. In this work, we generated a large benchmark dataset comprising 124,879 experimentally supported linear epitope-containing regions in 3567 protein clusters from over 1.3 million B cell assays. Analysis of this curated dataset showed large pathogen diversity covering 176 different families. The accuracy in linear BCE prediction was found to strongly vary with different features, while all sequence-derived and structural features were informative. To search more efficient and interpretive feature representations, a ten-layer deep learning framework for linear BCE prediction, namely NetBCE, was developed. NetBCE achieved high accuracy and robust performance with the average area under the curve (AUC) value of 0.8455 in five-fold cross-validation through automatically learning the informative classification features. NetBCE substantially outperformed the conventional machine learning algorithms and other tools, with more than 22.06% improvement of AUC value compared to other tools using an independent dataset. Through investigating the output of important network modules in NetBCE, epitopes and non-epitopes tended to be presented in distinct regions with efficient feature representation along the network layer hierarchy. The NetBCE is freely available at https://github.com/bsml320/NetBCE.
15. TripletGO: Integrating Transcript Expression Profiles with Protein Homology Inferences for Gene Function Prediction
Yi-Heng Zhu, Chengxin Zhang, Yan Liu, Gilbert S. Omenn, Peter L. Freddolino, Dong-Jun Yu, Yang Zhang
Gene Ontology (GO) has been widely used to annotate functions of genes and gene products. Here, we proposed a new method, TripletGO, to deduce GO terms of protein-coding and non-coding genes, through the integration of four complementary pipelines built on transcript expression profile, genetic sequence alignment, protein sequence alignment, and naïve probability. TripletGO was tested on a large set of 5754 genes from 8 species (human, mouse, Arabidopsis, rat, fly, budding yeast, fission yeast, and nematoda) and 2433 proteins with available expression data from the third Critical Assessment of Protein Function Annotation challenge (CAFA3). Experimental results show that TripletGO achieves function annotation accuracy significantly beyond the current state-of-the-art approaches. Detailed analyses show that the major advantage of TripletGO lies in the coupling of a new triplet network-based profiling method with the feature space mapping technique, which can accurately recognize function patterns from transcript expression profiles. Meanwhile, the combination of multiple complementary models, especially those from transcript expression and protein-level alignments, improves the coverage and accuracy of the final GO annotation results. The standalone package and an online server of TripletGO are freely available at https://zhanggroup.org/TripletGO/.
16. DrSim: Similarity Learning for Transcriptional Phenotypic Drug Discovery
Zhiting Wei, Sheng Zhu, Xiaohan Chen, Chenyu Zhu, Bin Duan, Qi Liu
Transcriptional phenotypic drug discovery has achieved great success, and various compound perturbation-based data resources, such as connectivity map (CMap) and library of integrated network-based cellular signatures (LINCS), have been presented. Computational strategies fully mining these resources for phenotypic drug discovery have been proposed. Among them, the fundamental issue is to define the proper similarity between transcriptional profiles. Traditionally, such similarity has been defined in an unsupervised way. However, due to the high dimensionality and the existence of high noise in high-throughput data, similarity defined in the traditional way lacks robustness and has limited performance. To this end, we present DrSim, which is a learning-based framework that automatically infers similarity rather than defining it. We evaluated DrSim on publicly available in vitro and in vivo datasets in drug annotation and repositioning. The results indicated that DrSim outperforms the existing methods. In conclusion, by learning transcriptional similarity, DrSim facilitates the broad utility of high-throughput transcriptional perturbation data for phenotypic drug discovery. The source code and manual of DrSim are available at https://github.com/bm2-lab/DrSim/.