基于神经网络的人类基因组启动子识别研究-智能与分布计算实验室

基于神经网络的人类基因组启动子识别研究

姓名	李滔
论文答辩日期	2005.05.10
论文提交日期	2005.05.19
论文级别	博士
中文题名	基于神经网络的人类基因组启动子识别研究
英文题名	Research on Neural Network-Based Promoter Prediction and Recognition in Human Genome
导师1	卢正鼎
导师2	陈传波
中文关键词	启动子预测;人类基因组;混合神经网络;真核生物;五聚体;影响因子;转录调控;反向传播算法
英文关键词	Promoter prediction;Human genome;Hybrid neural networks;Eukaryotic organisms;Pentamers;Impact factor;Transcriptional regulation;Back propagation algorithm 论文总页码 92
中文文摘	人类全基因组序列的公布和初步分析是分子生物学进展中的一个重要的里程碑。对基因组中基因的预测、功能的注释已成为现今分子生物学研究领域的热点和前沿。在过去几年中NCBI(National Center for Biotechnology Information), Ensembl and Golden Path已提出了初步的基因注释，而详细的注释仍需多年才能完成。现今的基因注释仅为蛋白质编码区域，而对基因转录调控区域的分析则进展缓慢，尤其是真核生物。其原因是真核生物中的转录调控区域只占整个基因组的极小一部分。因此全面了解基因组转录水平的调控具有挑战意义。启动子是基因表达调控的重要元件，它控制基因在什么时候、什么地点以及如何表达，从而实现预定、有序、不可逆转的分化发育过程。启动子决定了基因表达关键性的第一步：mRNA的转录。因此深入研究启动子的结构和功能，对于基因表达模式、基因调控网络、细胞特异性等方面是十分重要的。鉴于启动子具有如此重要的生物学作用，如何从基因中快速准确地识别出启动子，发现其中包含的信息，已成为后基因组时代一个非常重要的课题。对于真核生物启动子的识别问题，出现了多种预测方法，但预测的假阳性(false positive, FP)相当高，约1 FP/kilo-base，对于Giga-base的人类基因组而言，这样的预测显然毫无意义。2000年Scherf提出的PromoterInspector 算法首次有了突破性进展。在对人类22号染色体34 mega-base 的预测中，敏感性达到了43%，特异性达到43%。在这之后，各国研究小组都努力从各个方面提高启动子预测精度。基于计算机的启动子识别研究最关键的问题是建立具有生物学意义的启动子功能模型。由于调控的多途径、多个转录因子的协调作用以及它们结构的复杂性，使得用计算方法来分析启动子区域变得很困难。为了正确识别人类基因组启动子，需要在先验生物学知识上，对复杂的调控机理进行抽象简化。通过对转录起始位点(transcription start sites, TSS)，调控区、编码区和非编码区五聚体分布特征及CpG岛相关信息的分析，建立了具有生物学意义的启动子功能模型。这种模型提取的不仅仅是TATA盒、GC盒及CAAT盒等核心启动子元件的特征，还提取了与启动子功能相关的细微元件和结构模式。基于计算机的启动子识别研究另一个关键问题是启动子特征的选择与裁减问题。作为神经网络输入层的特征向量，如果节点数过多，一方面增加了资源的占用，影响运行速度；另一方面有可能降低网络性能，降低泛化能力。因此存在一个最佳输入层节点数的问题。提出了评价启动子特征向量相对重要性的影响因子Imp(X)，根据影响因子的大小来排列特征向量，选取相对重要的五聚体分布特征作为神经网络的输入。提出的基于混合神经网络的人类基因组启动子识别的算法-PromPredictor正是建立在以上的研究基础之上。它包含了全新的启动子识别模型、编码理论、特征选择及裁减以及机器学习算法。为了评价PromPredictor对人类基因组启动子识别能力，采用了3条G+C含量不一样的人类染色体作为评价测试，这3条染色体为人类基因组4、21、22号染色体，其G+C含量分别为38%、41%和48%。与PromPredictor相比较的是最近公布的几个有效的启动子识别算法。预测结果显示，PromPredictor对3条染色体启动子的综合结果为：敏感性64.47%，特异性82.2%。与其它几种算法相比，PromPredictor具有更高的敏感性和特异性。PromPredictor的程序代码已放到Internet上，网址是。
英文文摘	The publication and preliminary analysis of the human genome sequence marks a significant milestone in the field of molecular biology. Gene Prediction in the human genome and function annotation is getting more attractive in the research field of molecular biology now. In the past few years, many efforts have been devoted to gene annotations. The National Center for Biotechnology Information (NCBI), Ensembl and Golden Path, for instance, provided the initial annotations, but the whole process of annotation is expected to go on for many years, and the current gene annotations only refer to protein-coding regions. Relatively few tools have been developed to identify the regulatory regions required for the correct transcriptional activity of the genome. This task is particularly difficult in the case of eukaryotic organisms in which regulatory regions represent a small percentage, overwhelmed by presumably non-functional DNA. Understanding transcriptional regulation in genome is still a challenging problem. Promoter is an important element in gene expression and regulation. It controls when, where and how the gene expresses, and it realizes booking, orderly, irreversible process of differentiation growth. It decides the first step of gene expression: the transcription of mRAN. Therefore, further research on the structure and function of promoter is important for gene expression model, gene regulation network, cell specificity and so on. Due to the important biological function of the promoter, how to rapidly and exactly recognize promoter regions, and find information, becomes a key problem. Knowledge of promoters may be useful in elucidating regulation and expression mechanisms of genes, and may shed light on the function of novel and uncharacterized genes. Many kinds of prediction methods had appeared for eukaryotic promoter, but the false positive rate was quite high, roughly estimated at one per kilobase. In another aspect, the ratio of true prediction to false prediction is a small percent. As to Giga-base length of human genome, such prediction is obviously meaningless. In 2000, for the first time, there was breakthrough progress by PromoterInspector algorithm put forward by Scherf, which showed predicted accuracy of 43% in sensitivity and 43% in specificity. After that, the research groups of various countries make great efforts to improve the accuracy of promoter precision from all respects. The key problem of promoter recognition based on computer analysis is establishing function model of promoter with biological significance. Due to a different way of regulation, the coordinate action of several transcription factors, and the complexity of the structure, it is difficult to analyze promoter regions by computer technique. In order to recognize the promoter of human genome, it is necessary to simplify and abstract the complicated regulation mechanism based on biological knowledge. The function model of promoter with biological significance is established by combining information about transcription start sites (TSS), pentamer distributions in coding and regulatory regions, and CpG islands. This mode extracts, not only core promoter elements, such as TATA boxes, GC boxes and CAAT boxes, but also some weak signals. Another important problem of promoter recognition based on computer analysis is feature selection and dimensionality reduction. As the feature vectors of neural network input layer, too many nodes will take more computer resources and reduce computation rate. On the other hand, unnecessary nodes will effect network performance and debase generalization ability. Hence, it exists an optimized number of input nodes. Here, we propose an impact factor?Imp(X) to evaluate the relative importance of the patterns. According to the impact factor, we choose relatively important pentamer distribution features as the input of neural network. In this paper, we present a novel hybrid machine learning system, called PromPredictor, for recognizing promoter regions in the human genome. PromPredictor is based on the research mentioned above. It combines a new promoter recognition model, coding theory, feature selection and dimensionality reduction with machine learning algorithm. In order to evaluate the prediction ability of PromPredictor on the human genome, we chose three different G+C content chromosomes: Human chromosome 4, 21, and 22, with G+C content 38%, 41% and 48% respectively. Several newly development and efficient promoter prediction algorithm were chosen to compare with PromPredictor. The evaluation result showed that the final prediction of PromPredictor was 64.47% in sensitivity and 82.20% in specificity. Comparison with other systems revealed that our system had superior sensitivity and specificity in predicting promoter regions. We have posted our program codes on the Internet, PromPredictor is freely downloaded at .