Articles Online (Volume 18, Issue 1)

Perspective

The Elements of Data Sharing

Zhang Zhang, Shuhui Song, Jun Yu, Wenming Zhao, Jingfa Xiao, Yiming Bao

Page 1-4


Perspective

The Birth of Bio-data Science: Trends, Expectations, and Applications

Wilson Wen Bin Goh, Limsoon Wong

生物大数据,机器学习和人工智能的兴起将会改变生物学领域的研究方式。在此背景下,生物数据科学(Bio-Data Science, BDS)将会作为一门全新的学科出现。我们讨论的BDS内容包括:BDS与计算生物学和生物信息学等相关学科之间的关系;趋势和未来;培训,教育及就业方向的转变;以及BDS在实际分析中的有利的应用。

Page 5-15


Review

Role of Long Non-coding RNAs in Reprogramming to Induced Pluripotency

Shahzina Kanwal, Xiangpeng Guo, Carl Ward, Giacomo Volpe, Baoming Qin, Miguel A. Esteban, Xichen Bao

The generation of induced pluripotent stem cells through somatic cell reprogramming requires a global reorganization of cellular functions. This reorganization occurs in a multi-phased manner and involves a gradual revision of both the epigenome and transcriptome. Recent studies have shown that the large-scale transcriptional changes observed during reprogramming also apply to long non-coding RNAs (lncRNAs), a type of traditionally neglected RNA species that are increasingly viewed as critical regulators of cellular function. Deeper understanding of lncRNAs in reprogramming may not only help to improve this process but also have implications for studying cell plasticity in other contexts, such as development, aging, and cancer. In this review, we summarize the current progress made in profiling and analyzing the role of lncRNAs in various phases of somatic cell reprogramming, with emphasis on the re-establishment of the pluripotency gene network and X chromosome reactivation.
体细胞重编程是体细胞转变为诱导多能干细胞的过程。在这个过程中,细胞的功能发生了巨大的变化并伴随着表观遗传组和转录组的逐步重构。最近的研究表明,一种被长期忽视的非编码RNA类型—长链非编码RNA (long non-coding RNA, lncRNA),在重编程过程中经历了转录水平的大范围变化。近来研究还揭示,长链非编码RNA是细胞功能的重要调控因子。因此,深入地理解长链非编码RNA在重编程中的功能和调控作用不仅有助于我们提高重编程效率和质量,而且对在其他系统(如发育,衰老和癌症)中的细胞可塑性研究也有一定的启示作用。本综述总结了已经报道的非编码RNA在体细胞重编程不同阶段中的功能方面取得的重要进展,同时还重点阐述了重编程过程中的多能性基因网络重建和X染色体的重新激活。

Page 16-25


Original Research

CRISPR Screens Identify Essential Cell Growth Mediators in BRAF Inhibitor-resistant Melanoma

Ziyi Li, Binbin Wang, Shengqing Gu, Peng Jiang, Avinash Sahu, Chen-Hao Chen, Tong Han, Sailing Shi, Xiaoqing Wang, Nicole Traugh, Hailing Liu, Yin Liu, Qiu Wu, Myles Brown, Tengfei Xiao, Genevieve M. Boland, X. Shirley Liu

BRAF is a serine/threonine kinase that harbors activating mutations in ∼7% of human malignancies and ∼60% of melanomas. Despite initial clinical responses to BRAF inhibitors, patients frequently develop drug resistance. To identify candidate therapeutic targets for BRAF inhibitor resistant melanoma, we conduct CRISPR screens in melanoma cells harboring an activating BRAF mutation that had also acquired resistance to BRAF inhibitors. To investigate the mechanisms and pathways enabling resistance to BRAF inhibitors in melanomas, we integrate expression, ATAC-seq, and CRISPR screen data. We identify the JUN family transcription factors and the ETS family transcription factor ETV5 as key regulators of CDK6, which together enable resistance to BRAF inhibitors in melanoma cells. Our findings reveal genes contributing to resistance to a selective BRAF inhibitor PLX4720, providing new insights into gene regulation in BRAF inhibitor resistant melanoma cells.
黑色素瘤(Melanoma)作为一种恶性肿瘤,具有预后差、生存率低的特点,而且近半数的转移性黑素瘤患者中丝氨酸/苏氨酸激酶BRAF在密码子600的位置存在突变,最常见的是V600E或V600K。BRAF在转移性黑色素瘤中高频突变促进了对靶向突变BRAF的小分子药物的研究和开发。早期临床试验表明,BRAF小分子抑制剂作为针对具有BRAF V600E突变的黑素瘤病人的一种治疗方法显示出了巨大的前景。虽然具有BRAF突变的癌症病人在用药初期对BRAF的小分子抑制剂有很好的反应,但有一部分病人通常会在用药后的一段时间出现获得性耐药而导致癌症的复发。所以系统性地研究BRAF小分子抑制剂的耐药机制,对于寻找有效的治疗方案,解决癌症耐药问题至关重要。同时,对BRAF抑制剂的耐药机制进行研究,将有利于了解黑素瘤产生耐药性过程中的信号通路,为规避耐药性的产生以及提高药物疗效提供重要信息。为了系统地研究BRAF抑制剂在黑色素瘤中的耐药机制,我们在具有BRAF V600E突变的细胞系中进行了一系列实验。这些黑色素瘤细胞系通过长期的药物处理后,获得了对BRAF小分子抑制剂PLX4032的耐药性。所以,我们利用CRISPR筛选帮助我们寻找与BRAF抑制剂耐药性相关的信号通路和基因。同时,我们整合了转录组数据和表观遗传图谱数据的分析,揭示了BRAF抑制剂耐药细胞内的基因调控网络。

Page 26-40


Original Research

Epitranscriptomic 5-Methylcytosine Profile in PM2.5-induced Mouse Pulmonary Fibrosis

Xiao Han, Hanchen Liu, Zezhong Zhang, Wenlan Yang, Chunyan Wu, Xueying Liu, Fang Zhang, Baofa Sun, Yongliang Zhao, Guibin Jiang, Yun-Gui Yang, Wenjun Ding

Exposure of airborne particulate matter (PM) with an aerodynamic diameter less than 2.5 μm (PM2.5) is epidemiologically associated with lung dysfunction and respiratory symptoms, including pulmonary fibrosis. However, whether epigenetic mechanisms are involved in PM2.5-induced pulmonary fibrosis is currently poorly understood. Herein, using a PM2.5-induced pulmonary fibrosis mouse model, we found that PM2.5 exposure leads to aberrant mRNA 5-methylcytosine (m5C) gain and loss in fibrotic lung tissues. Moreover, we showed the m5C-mediated regulatory map of gene functions in pulmonary fibrosis after PM2.5 exposure. Several genes act as m5C gain-upregulated factors, probably critical for the development of PM2.5-induced fibrosis in mouse lungs. These genes, including Lcn2, Mmp9, Chi3l1, Adipoq, Atp5j2, Atp5l, Atpif1, Ndufb6, Fgr, Slc11a1, and Tyrobp, are highly related to oxidative stress response, inflammatory responses, and immune system processes. Our study illustrates the first epitranscriptomic RNA m5C profile in PM2.5-induced pulmonary fibrosis and will be valuable in identifying biomarkers for PM2.5 exposure-related lung pathogenesis with translational potential.
PM2.5引起的肺纤维化鼠的表观转录组m5C图谱 PM2.5是暴露在空气中动力学直径小于2.5μm微粒物质,其与肺功能障碍和呼吸道疾病相关,包括肺纤维化。但是,目前尚不清楚PM2.5诱导肺纤维化的表观遗传机制。在本文中,我们使用PM2.5诱导的肺纤维化小鼠模型,利用高通量测序技术及生物信息学分析方法发现PM2.5暴露导致纤维化肺组织mRNA 5-甲基胞嘧啶(m5C)位点数目及甲基化水平的异常变化。此外,分析结果展示了PM2.5暴露后m5C介导的肺纤维化基因功能的调控图谱。进一步筛选得到的m5C增益的上调基因,可能对PM2.5诱导的小鼠肺纤维化发展至关重要。这些基因包括Lcn2,Mmp9,Chi3l1,Adipoq,Atp5j2,Atp51,Atpif1,Ndufb6,Fgr,Slc11a1和Tyrobp,它们与氧化应激反应、炎症反应和免疫系统过程高度相关。我们的研究绘制了第一个PM2.5诱导的肺纤维化中RNA m5C图谱,对于开发PM2.5暴露相关性肺病的潜在生物标志物具有重要价值。

Page 41-51


Original Research

Procleave: Predicting Protease-specific Substrate Cleavage Sites by Combining Sequence and Structural Information

Fuyi Li, Andre Leier, Quanzhong Liu, Yanan Wang, Dongxu Xiang, Tatsuya Akutsu, Geoffrey I. Webb, A. Ian Smith, Tatiana Marquez-Lago, Jian Li, Jiangning Song

Proteases are enzymes that cleave and hydrolyse the peptide bonds between two specific amino acid residues of target substrate proteins. Protease-controlled proteolysis plays a key role in the degradation and recycling of proteins, which is essential for various physiological processes. Thus, solving the substrate identification problem will have important implications for the precise understanding of functions and physiological roles of proteases, as well as for therapeutic target identification and pharmaceutical applicability. Consequently, there is a great demand for bioinformatics methods that can predict novel substrate cleavage events with high accuracy by utilizing both sequence and structural information. In this study, we present Procleave, a novel bioinformatics approach for predicting protease-specific substrates and specific cleavage sites by taking into account both their sequence and 3D structural information. Structural features of known cleavage sites were represented by discrete values using a LOWESS data-smoothing optimization method, which turned out to be critical for the performance of Procleave. The optimal approximations of all structural parameter values were encoded in a conditional random field (CRF) computational framework, alongside sequence and chemical group-based features. Here, we demonstrate the outstanding performance of Procleave through extensive benchmarking and independent tests. Procleave is capable of correctly identifying most cleavage sites in the case study. Importantly, when applied to the human structural proteome encompassing 17,628 protein structures, Procleave suggests a number of potential novel target substrates and their corresponding cleavage sites of different proteases. Procleave is implemented as a webserver and is freely accessible at http://procleave.erc.monash.edu/.
蛋白酶是水解目标底物蛋白质的两个特定氨基酸残基之间的肽键的酶。由蛋白酶控制的蛋白质特异性水解在蛋白质的降解和循环中起着关键作用,这对于各种生理过程必不可少。因此,解决蛋白酶的底物识别问题,对于准确理解蛋白酶的功能和生理作用,以及治疗靶点识别和药物适用性具有重要意义。因此,基于序列信息和结构信息预测底物裂解的生物信息学方法有着巨大的需求。在本研究中,我们开发了一种新的生物信息学方法Procleave来预测蛋白酶的特异性底物及其裂解位点。这一方法考考虑了序列和三维结构信息,利用LOWESS数据平滑优化方法将已知裂解位点的结构特征用离散值表示,所有结构参数值采用最佳近似值,在此基础上结合了蛋白质序列和氨基酸化学组特征,编码进基于条件随机场(CRF)的计算框架中。大量的基准测试和独立测试实验结果表明,Procleave的预测性能优越,能够正确识别案例研究中的大多数裂解位点。此外,我们应用Procleave对包含17628个蛋白质结构的全人类结构蛋白质组进行蛋白组预测,识别出了一些新的潜在的底物及其对应的裂解位点。Procleave webserver可在http://Procleave.erc.monash.edu/免费访问。

Page 52-64


Method

MSIsensor-pro: Fast, Accurate, and Matched-normal-sample-free Detection of Microsatellite Instability

Peng Jia, Xiaofei Yang, Li Guo, Bowen Liu, Jiadong Lin, Hao Liang, Jianyong Sun, Chengsheng Zhang, Kai Ye

Microsatellite instability (MSI) is a key biomarker for cancer therapy and prognosis. Traditional experimental assays are laborious and time-consuming, and next-generation sequencing-based computational methods do not work on leukemia samples, paraffin-embedded samples, or patient-derived xenografts/organoids, due to the requirement of matched normal samples. Herein, we developed MSIsensor-pro, an open-source single sample MSI scoring method for research and clinical applications. MSIsensor-pro introduces a multinomial distribution model to quantify polymerase slippages for each tumor sample and a discriminative site selection method to enable MSI detection without matched normal samples. We demonstrate that MSIsensor-pro is an ultrafast, accurate, and robust MSI calling method. Using samples with various sequencing depths and tumor purities, MSIsensor-pro significantly outperformed the current leading methods in both accuracy and computational cost. MSIsensor-pro is available at https://github.com/xjtu-omics/msisensor-pro and free for non-commercial use, while a commercial license is provided upon request.
微卫星不稳定性(Microsatellite instability, MSI)是由恶性肿瘤DNA错配修复系统受损导致,在基因组微卫星区域发生超突变的一种分子表型,多发于结直肠癌、胃癌、和子宫内膜癌。MSI与肿瘤的发生、发展及预后密切相关,更是免疫治疗疗效预测的分子标记物。当前,临床上使用的两种MSI的金标准检测方法分别是MSI-PCR和MSI-IHC,但都费时费力且成本较高。近年来,随着高通量测序的发展,基于NGS的MSI检测方法开始显露头角,在检测结果与两种临床金标准保持高度一致的情况下,极大的缩减了检测时间并减少了检测成本,大幅提高了推广MSI检测的可行性。 2014年,叶凯教授及其团队基于NGS开发的MSIsensor,作为全世界首个泛肿瘤检测方案MSK-IMPACT的MSI计算方法,通过了美国FDA的严格测试并获得批准。美国纪念斯隆凯特琳癌症中心(MSKCC)的测试表明,MSIsensor与金标准的一致性可达99.4%。然而,包括MSIsensor在内的基于NGS的MSI检测算法大都在低肿瘤纯度、低测序深度情况下表现较差。特别是,由于这些算法要求输入与肿瘤样本匹配的对照样本,限制了MSI的应用场景,尤其难以应用于血癌标本、福尔马林包埋标本、PDX/PDO等不易获得正常对照的样本。基于此,本研究综合MSI发生机理,采用多项分布描述基因组微卫星区域的复制过程,并对DNA聚合酶在该区域的滑动情况进行定量模拟。基于微卫星区域DNA聚合酶滑动的数学模型,叶凯团队于2019年成功开发了MSIsensor-pro,实现了对无正常对照样本的单个肿瘤样本进行MSI评估,解决了当前基于NGS的MSI检测工具所存在的技术难题。

Page 65-71


Web Server

GPS 5.0: An Update on the Prediction of Kinase-specific Phosphorylation Sites in Proteins

Chenwei Wang, Haodong Xu, Shaofeng Lin, Wankun Deng, Jiaqi Zhou, Ying Zhang, Ying Shi, Di Peng, Yu Xue

In eukaryotes, protein phosphorylation is specifically catalyzed by numerous protein kinases (PKs), faithfully orchestrates various biological processes, and reversibly determines cellular dynamics and plasticity. Here we report an updated algorithm of Group-based Prediction System (GPS) 5.0 to improve the performance for predicting kinase-specific phosphorylation sites (p-sites). Two novel methods, position weight determination (PWD) and scoring matrix optimization (SMO), were developed. Compared with other existing tools, GPS 5.0 exhibits a highly competitive accuracy. Besides serine/threonine or tyrosine kinases, GPS 5.0 also supports the prediction of dual-specificity kinase-specific p-sites. In the classical module of GPS 5.0, 617 individual predictors were constructed for predicting p-sites of 479 human PKs. To extend the application of GPS 5.0, a species-specific module was implemented to predict kinase-specific p-sites for 44,795 PKs in 161 eukaryotes. The online service and local packages of GPS 5.0 are freely available for academic research at http://gps.biocuckoo.cn.
真核生物中,蛋白激酶通过特异性磷酸化底物蛋白质参与调控许多重要生物学过程,并可逆地决定了细胞的动力学和可塑性。本研究中,我们基于已有的算法基础,设计了新版本的分组打分系统(Group-based Prediction System, GPS)5.0算法,提出了“位置权重决定”(Position weight determination, PWD)和“打分矩阵优化”(Scoring matrix optimization, SMO)两个新方法,并利用“惩罚逻辑回归”(Penalized logistic regression, PLR)算法训练模型,显著提高了激酶特异性磷酸化位点的预测准确性。与其他工具相比,GPS 5.0具有更高的准确性。除丝氨酸/苏氨酸激酶或酪氨酸激酶外,GPS 5.0还提供了针对双特异性激酶的位点预测。GPS 5.0软件包括两个模块,其中经典模块构建了617个独立的预测器,可针对人类479个蛋白激酶进行预测;物种特异性模块可住准确预测161种真核生物中44,795个激酶的特异性位点。GPS 5.0的本地软件及在线预测工具获取可访问链接:http://gps.biocuckoo.cn。

Page 72-80


Application Note

SinoDuplex: An Improved Duplex Sequencing Approach to Detect Low-frequency Variants in Plasma cfDNA Samples

Yongzhe Ren, Yang Zhang, Dandan Wang, Fengying Liu, Ying Fu, Shaohua Xiang, Li Su, Jiancheng Li, Heng Dai, Bingding Huang

Accurate detection of low frequency mutations from plasma cell-free DNA in blood using targeted next generation sequencing technology has shown promising benefits in clinical settings. Duplex sequencing technology is the most commonly used approach in liquid biopsies. Unique molecular identifiers are attached to each double-stranded DNA template, followed by production of low-error consensus sequences to detect low frequency variants. However, high sequencing costs have hindered application of this approach in clinical practice. Here, we have developed an improved duplex sequencing approach called SinoDuplex, which utilizes a pool of adapters containing pre-defined barcode sequences to generate far fewer barcode combinations than with random sequences, and implemented a novel computational analysis algorithm to generate duplex consensus sequences more precisely. SinoDuplex increased the output of duplex sequencing technology, making it more cost-effective. We evaluated our approach using reference standard samples and cell-free DNA samples from lung cancer patients. Our results showed that SinoDuplex has high sensitivity and specificity in detecting very low allele frequency mutations. The source code for SinoDuplex is freely available at https://github.com/SinOncology/sinoduplex.

Page 81-90