Article Online

Articles Online (Volume 18, Issue 5)

Original Research

Ambient Temperature is A Strong Selective Factor Influencing Human Development and Immunity

Lindan Ji, Dongdong Wu, Haibing Xie, Binbin Yao, Yanming Chen, David M. Irwin, Dan Huang, Jin Xu, Nelson L.S. Tang, Yaping Zhang

Solar energy, which is essential for the origin and evolution of all life forms on Earth, can be objectively recorded through attributes such as climatic ambient temperature (CAT), ultraviolet radiation (UVR), and sunlight duration (SD). These attributes have specific geographical variations and may cause different adaptation traits. However, the adaptation profile of each attribute and the selective role of solar energy as a whole during human evolution remain elusive. Here, we performed a genome-wide adaptation study with respect to CAT, UVR, and SD using the Human Genome Diversity Project-Centre Etude Polymorphism Humain (HGDP-CEPH) panel data. We singled out CAT as the most important driving force with the highest number of adaptive loci (6 SNPs at the genome-wide 1 × 10−7 level; 401 at the suggestive 1 × 10−5 level). Five of the six genome-wide significant adaptation SNPs were successfully replicated in an independent Chinese population (N = 1395). The corresponding 316 CAT adaptation genes were mostly involved in development and immunity. In addition, 265 (84%) genes were related to at least one genome-wide association study (GWAS)-mapped human trait, being significantly enriched in anthropometric loci such as those associated with body mass index (χ2; P < 0.005), immunity, metabolic syndrome, and cancer (χ2; P < 0.05). For these adaptive SNPs, balancing selection was evident in Euro-Asians, whereas obvious positive and/or purifying selection was observed in Africans. Taken together, our study indicates that CAT is the most important attribute of solar energy that has driven genetic adaptation in development and immunity among global human populations. It also supports the non-neutral hypothesis for the origin of disease-predisposition alleles in common diseases.
对地球生命的起源与进化至关重要的太阳能,可通过气温(CAT)、紫外辐射(UVR)和日照时长(SD)这三个主要属性进行衡量。在不同地域,这三者呈现出不同的组合方式并可能导致不同的生物适应特征。目前,人类群体对以这三种属性为代表的太阳能的适应模式,尤其是对每一属性的适应特征,尚未被完全阐释清楚。在本研究中,我们通过全基因组适应性分析,筛选了法国人类多样性研究中心的人类基因组多样性计划(HGDP-CEPH)数据库中可能受到气温、紫外辐射和日照时长选择的单核苷酸多态性(SNP)。我们发现与气温显著相关的位点最多(基因组1×10-7水平有6个SNP,达到1×10-5水平有401个SNP),提示气温可能是三种属性中起主要作用的因素。在这6个全基因组水平显著的气温信号中,5个在独立的中国人群(N = 1395)中得到验证。此外,401个SNP归属的316个气温相关基因主要与发育和免疫相关。其中,265个基因(84%)与至少一个全基因组关联研究(GWAS)相关的人类特征相关,并在人体测量学指标如体质指数BMI、免疫、代谢综合征和肿瘤等类别显著富集。总体上,这些气温适应信号在欧亚人群显示出平衡选择,在非洲则显示为净化选择。综上所述,本研究提示气温可能在太阳能对全球人群的发育和免疫功能的选择作用中起了最主要的作用。同时,研究结果也支持常见复杂疾病的易感基因的非中性起源假说。

Page 489-500

Original Research

Landscape and Dynamics of the Transcriptional Regulatory Network During Natural Killer Cell Differentiation

Kun Li, Yang Wu, Young Li, Qiaoni Yu, Zhigang Tian, Haiming Wei, Kun Qu

Natural killer (NK) cells are essential in controlling cancer and infection. However, little is known about the dynamics of the transcriptional regulatory machinery during NK cell differentiation. In this study, we applied the assay of transposase accessible chromatin with sequencing (ATAC-seq) technique in a home-developed in vitro NK cell differentiation system. Analysis of ATAC-seq data illustrated two distinct transcription factor (TF) clusters that dynamically regulate NK cell differentiation. Moreover, two TFs from the second cluster, FOS-like 2 (FOSL2) and early growth response 2 (EGR2), were identified as novel essential TFs that control NK cell maturation and function. Knocking down either of these two TFs significantly impacted NK cell differentiation. Finally, we constructed a genome-wide transcriptional regulatory network that provides a better understanding of the regulatory dynamics during NK cell differentiation.
天然杀伤(NK)细胞是先天性淋巴细胞,可保护宿主免受感染或癌细胞侵袭。此外,基于NK细胞的免疫疗法已成为癌症治疗中的新兴力量,并将在未来疾病治疗中发挥重要作用,而NK细胞用于免疫治疗依赖于大量具有最佳细胞活性的NK细胞。因此,全面了解NK细胞的分化过程对于提高临床治疗的有效性尤其重要。在这项研究中,我们利用ATAC-seq技术在体外诱导NK细胞分化系统中检测NK细胞分化过程中染色质可及性的变化。对ATAC-seq数据的分析发现两个不同的转录因子(TF)簇动态调控NK细胞的分化。 此外,来自第二个簇的两个TFs ,FOSL2和EGR2,被确定为调控NK细胞成熟和功能的新的必需转录因子。 敲低这两个TF中的任何一个,都会明显影响NK细胞的分化。 最后,我们构建了一个全基因组范围的转录调控网络,可以全面了解NK细胞的分化过程。

Page 501-515

Original Research

The Biological Significance of Multi-copy Regions and Their Impact on Variant Discovery

Jing Sun, Yanfang Zhang, Minhui Wang, Qian Guan, Xiujia Yang, Jin Xia Ou, Mingchen Yan, Chengrui Wang, Yan Zhang, Zhi-Hao Li, Chunhong Lan, Chen Mao, Hong-Wei Zhou, Bingtao Hao, Zhenhai Zhang

Identification of genetic variants via high-throughput sequencing (HTS) technologies has been essential for both fundamental and clinical studies. However, to what extent the genome sequence composition affects variant calling remains unclear. In this study, we identified 63,897 multi-copy sequences (MCSs) with a minimum length of 300 bp, each of which occurs at least twice in the human genome. The 151,749 genomic loci (multi-copy regions, or MCRs) harboring these MCSs account for 1.98% of the genome and are distributed unevenly across chromosomes. MCRs containing the same MCS tend to be located on the same chromosome. Gene Ontology (GO) analyses revealed that 3800 genes whose UTRs or exons overlap with MCRs are enriched for Golgi-related cellular component terms and various enzymatic activities in the GO biological function category. MCRs are also enriched for loci that are sensitive to neocarzinostatin-induced double-strand breaks. Moreover, genetic variants discovered by genome-wide association studies and recorded in dbSNP are significantly underrepresented in MCRs. Using simulated HTS datasets, we show that false variant discovery rates are significantly higher in MCRs than in other genomic regions. These results suggest that extra caution must be taken when identifying genetic variants in the MCRs via HTS technologies.

Page 516-524

Original Research

Kinase–substrate Edge Biomarkers Provide A More Accurate Prognostic Prediction in ER-negative Breast Cancer

Yidi Sun, Chen Li, Shichao Pang, Qianlan Yao, Luonan Chen, Yixue Li, Rong Zeng

The estrogen receptor (ER)-negative breast cancer subtype is aggressive with few treatment options available. To identify specific prognostic factors for ER-negative breast cancer, this study included 705,729 and 1034 breast invasive cancer patients from the Surveillance, Epidemiology, and End Results (SEER) and The Cancer Genome Atlas (TCGA) databases, respectively. To identify key differential kinase–substrate node and edge biomarkers between ER-negative and ER-positive breast cancer patients, we adopted a network-based method using correlation coefficients between molecular pairs in the kinase regulatory network. Integrated analysis of the clinical and molecular data revealed the significant prognostic power of kinase–substrate node and edge features for both subtypes of breast cancer. Two promising kinase–substrate edge features, CSNK1A1–NFATC3 and SRC–OCLN, were identified for more accurate prognostic prediction in ER-negative breast cancer patients.
研究问题: ER阴性乳腺癌的新型药物靶点 研究方法: 对705,704例来自SEER和1034例来自TCGA的乳腺癌患者RNAseq数据集进行了分析,根据表达量构建了激酶-底物边强度,通过LASSO回归筛选出了ER阳性和阴性亚型特异的激酶-底物特征,利用随机森林的方法阐明了激酶-底物点和边强度特征对乳腺癌两种亚型预后预测的影响。 主要成果1: 在SEER和TCGA两个数据集中,ER阴性乳腺癌都表现出了相比于ER阳性更差的生存。 主要成果2: 根据每一对激酶和底物之间的相关性,激酶和底物表达的“点”特征被转换为激酶-底物的边强度。LASSO回归筛选出了ER阳性和阴性亚型特异的激酶-底物特征。 主要成果3: 乳腺癌中激酶-底物相关特征具有非常好的预后判别能力,并且显著好于已知的乳腺癌预后标记物。 主要成果4: 2条ER阴性乳腺癌亚型特异的激酶-底物边CSNK1A1-NFATC3和SRC-OCLN,为ER阴性乳腺癌患者的治疗提供了新的思路。

Page 525-538

Original Research

Pooled Plasmid Sequencing Reveals the Relationship Between Mobile Genetic Elements and Antimicrobial Resistance Genes in Clinically Isolated Klebsiella pneumoniae

Yan Jiang, Yanfei Wang, Xiaoting Hua, Yue Qu, Anton Y. Peleg, Yunsong Yu

Plasmids remain important microbial components mediating the horizontal gene transfer (HGT) and dissemination of antimicrobial resistance. To systematically explore the relationship between mobile genetic elements (MGEs) and antimicrobial resistance genes (ARGs), a novel strategy using single-molecule real-time (SMRT) sequencing was developed. This approach was applied to pooled conjugative plasmids from clinically isolated multidrug-resistant (MDR) Klebsiella pneumoniae from a tertiary referral hospital over a 9-month period. The conjugative plasmid pool was obtained from transconjugants that acquired antimicrobial resistance after plasmid conjugation with 53 clinical isolates. The plasmid pool was then subjected to SMRT sequencing, and 82 assembled plasmid fragments were obtained. In total, 124 ARGs (responsible for resistance to β-lactam, fluoroquinolone, and aminoglycoside, among others) and 317 MGEs [including transposons (Tns), insertion sequences (ISs), and integrons] were derived from these fragments. Most of these ARGs were linked to MGEs, allowing for the establishment of a relationship network between MGEs and/or ARGs that can be used to describe the dissemination of resistance by mobile elements. Key elements involved in resistance transposition were identified, including IS26, Tn3, IS903B, ISEcp1, and ISKpn19. As the most predominant IS in the network, a typical IS26-mediated multicopy composite transposition event was illustrated by tracing its flanking 8-bp target site duplications (TSDs). The landscape of the pooled plasmid sequences highlights the diversity and complexity of the relationship between MGEs and ARGs, underpinning the clinical value of dominant HGT profiles.

Page 539-548

Original Research

Exploring Potential Signals of Selection for Disordered Residues in Prokaryotic and Eukaryotic Proteins

Arup Panda, Tamir Tuller

Intrinsically disordered proteins (IDPs) are an important class of proteins in all domains of life for their functional importance. However, how nature has shaped the disorder potential of prokaryotic and eukaryotic proteins is still not clearly known. Randomly generated sequences are free of any selective constraints, thus these sequences are commonly used as null models. Considering different types of random protein models, here we seek to understand how the disorder potential of natural eukaryotic and prokaryotic proteins differs from random sequences. Comparing proteome-wide disorder content between real and random sequences of 12 model organisms, we noticed that eukaryotic proteins are enriched in disordered regions compared to random sequences, but in prokaryotes such regions are depleted. By analyzing the position-wise disorder profile, we show that there is a generally higher disorder near the N- and C-terminal regions of eukaryotic proteins as compared to the random models; however, either no or a weak such trend was found in prokaryotic proteins. Moreover, here we show that this preference is not caused by the amino acid or nucleotide composition at the respective sites. Instead, these regions were found to be endowed with a higher fraction of protein–protein binding sites, suggesting their functional importance. We discuss several possible explanations for this pattern, such as improving the efficiency of protein–protein interaction, ribosome movement during translation, and post-translational modification. However, further studies are needed to clearly understand the biophysical mechanisms causing the trend.

Page 549-564


PIMD: An Integrative Approach for Drug Repositioning Using Multiple Characterization Fusion

Song He, Yuqi Wen, Xiaoxi Yang; Zhen Liu; Xinyu Song; Xin Huang; Xiaochen Bo

The accumulation of various types of drug informatics data and computational approaches for drug repositioning can accelerate pharmaceutical research and development. However, the integration of multi-dimensional drug data for precision repositioning remains a pressing challenge. Here, we propose a systematic framework named PIMD to predict drug therapeutic properties by integrating multi-dimensional data for drug repositioning. In PIMD, drug similarity networks (DSNs) based on chemical, pharmacological, and clinical data are fused into an integrated DSN (iDSN) composed of many clusters. Rather than simple fusion, PIMD offers a systematic way to annotate clusters. Unexpected drugs within clusters and drug pairs with a high iDSN similarity score are therefore identified to predict novel therapeutic uses. PIMD provides new insights into the universality, individuality, and complementarity of different drug properties by evaluating the contribution of each property data. To test the performance of PIMD, we use chemical, pharmacological, and clinical properties to generate an iDSN. Analyses of the contributions of each drug property indicate that this iDSN was driven by all data types and performs better than other DSNs. Within the top 20 recommended drug pairs, 7 drugs have been reported to be repurposed. The source code for PIMD is available at
研究问题: 通过计算策略实现药物重定位 解决方案: 提出一种多组学数据融合计算框架PIMD 实现方式: 将基于化学、药理学和临床属性数据的多个药物相似网络融合成一个整合的药物相似网络,并提供了一种系统的方式来注释药物社团。 源码:

Page 565-581


GTB-PPI: Predict Protein–protein Interactions Based on L1-regularized Logistic Regression and Gradient Tree Boosting

Bin Yu, Cheng Chen, Hongyan Zhou, Bingqiang Liu, Qin Ma

Protein–protein interactions (PPIs) are of great importance to understand genetic mechanisms, delineate disease pathogenesis, and guide drug design. With the increase of PPI data and development of machine learning technologies, prediction and identification of PPIs have become a research hotspot in proteomics. In this study, we propose a new prediction pipeline for PPIs based on gradient tree boosting (GTB). First, the initial feature vector is extracted by fusing pseudo amino acid composition (PseAAC), pseudo position-specific scoring matrix (PsePSSM), reduced sequence and index-vectors (RSIV), and autocorrelation descriptor (AD). Second, to remove redundancy and noise, we employ L1-regularized logistic regression (L1-RLR) to select an optimal feature subset. Finally, GTB-PPI model is constructed. Five-fold cross-validation showed that GTB-PPI achieved the accuracies of 95.15% and 90.47% on Saccharomyces cerevisiae and Helicobacter pylori datasets, respectively. In addition, GTB-PPI could be applied to predict the independent test datasets for Caenorhabditis elegans, Escherichia coli, Homo sapiens, and Mus musculus, the one-core PPI network for CD9, and the crossover PPI network for the Wnt-related signaling pathways. The results show that GTB-PPI can significantly improve accuracy of PPI prediction. The code and datasets of GTB-PPI can be downloaded from

Page 582-592


iLBE for Computational Identification of Linear B-cell Epitopes by Integrating Sequence and Evolutionary Features

Md. Mehedi Hasan; Mst. Shamima Khatun; Hiroyuki Kurata

Linear B-cell epitopes are critically important for immunological applications, such as vaccine design, immunodiagnostic test, and antibody production, as well as disease diagnosis and therapy. The accurate identification of linear B-cell epitopes remains challenging despite several decades of research. In this work, we have developed a novel predictor, Identification of Linear B-cell Epitope (iLBE), by integrating evolutionary and sequence-based features. The successive feature vectors were optimized by a Wilcoxon-rank sum test. Then the random forest (RF) algorithm using the optimal consecutive feature vectors was applied to predict linear B-cell epitopes. We combined the RF scores by the logistic regression to enhance the prediction accuracy. iLBE yielded an area under curve score of 0.809 on the training dataset and outperformed other prediction models on a comprehensive independent dataset. iLBE is a powerful computational tool to identify the linear B-cell epitopes and would help to develop penetrating diagnostic tests. A web application with curated datasets for iLBE is freely accessible at
B細胞リニアエピトープは、ワクチンの設計、免疫診断テスト、抗体産生、疾患の診断や治療などの免疫学的応用に非常に重要である。B細胞リニアエピトープの正確な同定は、数十年の研究にもかかわらず、依然として挑戦的課題のままである。本研究では、配列の進化的特徴や物理化学的特徴等を統合することにより、新規なB細胞エピトープ予測モデル(iLBE)を開発した。Wilcoxon順位和検定によって最適化した特徴ベクトル群をランダムフォレスト(RF)アルゴリズムを用いて学習して、B細胞リニアエピトープの予測スコアを計算した。ロジスティック回帰を用いてRFスコアを組合せて、予測精度を高めた。 iLBEは、トレーニングデータセットで0.809のAUCを達成し、独立のテストデータセットを用いた検定では、既存の予測モデルの性能を超えた。 B細胞リニアエピトープを同定する強力な計算ツールであるiLBEは、診断テストの開発に有用である。注釈付きデータセットを備えたiLBE モデルのウエブアプリケーションは自由にアクセスできる。

Page 593-600

Application Note

SeSaMe: Metagenome Sequence Classification of Arbuscular Mycorrhizal Fungi-associated Microorganisms

Jee Eun Kang, Antonio Ciampi, Mohamed Hijri

Arbuscular mycorrhizal fungi (AMF) are plant root symbionts that play key roles in plant growth and soil fertility. They are obligate biotrophic fungi that form coenocytic multinucleated hyphae and spores. Numerous studies have shown that diverse microorganisms live on the surface of and inside their mycelia, resulting in a metagenome when whole-genome sequencing (WGS) data are obtained from sequencing AMF cultivated in vivo. The metagenome contains not only the AMF sequences, but also those from associated microorganisms. In this study, we introduce a novel bioinformatics program, Spore-associated Symbiotic Microbes (SeSaMe), designed for taxonomic classification of short sequences obtained by next-generation DNA sequencing. A genus-specific usage bias database was created based on amino acid usage and codon usage of a three consecutive codon DNA 9-mer encoding an amino acid trimer in a protein secondary structure. The program distinguishes between coding sequence (CDS) and non-CDS, and classifies a query sequence into a genus group out of 54 genera used as reference. The mean percentages of correct predictions of the CDS and the non-CDS test sets at the genus level were 71% and 50% for bacteria, 68% and 73% for fungi (excluding AMF), and 49% and 72% for AMF (Rhizophagus irregularis), respectively. SeSaMe provides not only a means for estimating taxonomic diversity and abundance but also the gene reservoir of the reference taxonomic groups associated with AMF. Therefore, it enables users to study the symbiotic roles of associated microorganisms. It can also be applicable to other microorganisms as well as soil metagenomes. SeSaMe is freely available at

Page 601-612

Application Note

SeSaMe PS Function: Functional Analysis of the Whole Metagenome Sequencing Data of the Arbuscular Mycorrhizal Fungi

Jee Eun Kang, Antonio Ciampi, Mohamed Hijri

In this study, we introduce a novel bioinformatics program, Spore-associated Symbiotic Microbes Position-specific Function (SeSaMe PS Function), for position-specific functional analysis of short sequences derived from metagenome sequencing data of the arbuscular mycorrhizal fungi. The unique advantage of the program lies in databases created based on genus-specific sequence properties derived from protein secondary structure, namely amino acid usages, codon usages, and codon contexts of 3-codon DNA 9-mers. SeSaMe PS Function searches a query sequence against reference sequence database, identifies 3-codon DNA 9-mers with structural roles, and creates a comparative dataset containing the codon usage biases of the 3-codon DNA 9-mers from 54 bacterial and fungal genera. The program applies correlation principal component analysis in conjunction with K-means clustering method to the comparative dataset. 3-codon DNA 9-mers clustered as a sole member or with only a few members are often structurally and functionally distinctive sites that provide useful insights into important molecular interactions. The program provides a versatile means for studying functions of short sequences from metagenome sequencing and has a wide spectrum of applications. SeSaMe PS Function is freely accessible at

Page 613-623


Corrigendum to “Antibiotic Treatment Drives the Diversification of the Human Gut Resistome” [Genomics Proteomics Bioinformatics 17 (1) (2019) 39–51]

Jun Li, Elizabeth A. Rettedal, Ericvan der Helm, Mostafa Ellabaan, Gianni Panagiotou, Morten O.A. Sommer

Page 624-625