智能与分布计算实验室
  真核生物基因组结构自动注释系统研究
姓名 陆枫
论文答辩日期 2006.11.26
论文提交日期 2006.11.24
论文级别 博士
中文题名 真核生物基因组结构自动注释系统研究
英文题名 Research on Automatic Gene Structure Annotation System for Eukaryotic Genomes
导师1 卢正鼎
导师2 周艳红
中文关键词 真核生物;DNA序列;基因结构预测;基因组结构注释;基因组数据库;基因组浏览器;生物信息网格;集群计算
英文关键词 eukaryote;DNA sequence;prediction of gene structure;gene structure annotation for genomes;genome database;genome browser;bioinformatics grid;cluster computing.
中文文摘 自从全基因组测序成为可能以来,基因组结构注释(包括了解基因组DNA中的基因组成、结构及其调控元件)成为生物信息学研究的重要问题。为此,需要建立基因组结构自动注释系统,快速方便地得到基因组上的基因及其结构元件信息等;与此同时,需要采用相关的技术和手段保存和管理注释数据,并使该数据信息能为全球web用户访问和使用,必要时还须提供数据的安全保护等。此外,由于基因组结构注释系统的计算量庞大,还需要基于高性能计算资源进行基因组结构注释计算。针对以上问题,从事了如下几个方面的工作: 制订了整合不同的数据信息(如蛋白质序列、cDNA/mRNA序列、EST序列、全基因组序列等)和不同的基因结构预测方法(如蛋白质序列比对、EST序列分析、从头预测等)对基因组基因结构进行自动注释的整体流程,建立了基因组结构自动注释系统的整体框架。 从基因结构规律提炼、EST数据挖掘利用、模型与算法设计、软件研制等方面对真核生物基因结构预测进行了研究,研制出了基因结构从头预测、基于EST的基因外显子区域识别等软件。 建立了基因组结构注释数据库。以基因组结构注释元件为核心建立了数据库概念模型,可有效存储和管理基因组结构注释数据。基于数据库一次生成、多次访问的特征,通过采用允许冗余、允许关系表属性发生变化、细分实体等手段提高了数据存取效率。通过建立索引、依据坐标聚集性存储数据、数据预排序、分割数据、序列以二进制数据文件形式存储等措施,实现了对web检索访问、可视化浏览访问以及计算存取等的有效支持。采用数据库代码生成器降低了数据库开发代价。 研制开发了基因组可视化浏览器,实现了对基因组结构注释数据的可视化浏览访问等。采用了国际三大著名基因组浏览器共同采用的显示“轨道”的方式可视化展示注释元件,并针对其存在的不足,提出了以注释元件为中心组织相关数据、聚集同类同层次数据以及基于SVG技术提供方便的交互式操作等改进措施。以自适应步长的轨道漫游和自适应分辨率的轨道放缩等措施完善了以染色体为中心的导航策略。 建立了在高性能计算环境下整合上述注释软件、数据库以及web访问接口等的基因组结构自动注释系统。采用基于网格计算和集群计算的两级调度体系结构完成了基因组结构自动注释系统在高性能计算环境下的部署。
英文文摘 Since the sequences of species’ genomes represent the first closed data set in biology, the gene structure annotation for genomes, which include the prediction of gene composing, gene structure and gene regulators in genome DNA sequences, becomes the core issue in bioinformatics. An automatic genome annotation system based on the bioinformatics analysis method becomes a rapid and effective way to annotate different features in genomes which include genes and gene stuctures. At the same time, the need for scalable ways and technologies of storing and managing genome-scale annotation data will enable users to access and retrieve data through the global web, besides the necessarily information security and data protection. Moreover, on account of the huge demanding for computing power, the annotation system based on a set of analysis softwares must be based on high performance computing environment. To solve the above problems, the following several aspects of the work have been engaged. The gene-building pipeline, which enables fast automated annotation of eukaryotic genomes, based on evidence derived from known protein, cDNA/mRNA, EST, and whole genome sequences, integrates variable analyses and algorithms which include protein alignments, EST gene build and ab initio predictions. Hence, an automatic gene annotation system has been set up. Studies on prediction of eukaryotic gene structure are conducted from extracting features of eukaryotic gene structures, EST data-mining, models and algorithms designing, software development, and so on. Consequently, a software for ab initio prediction of eukaryotic gene structure and a program for identify true EST alignments and exon regions of genes are developed. Genome database integrating genome sequences data and annotations have been established. Central with the annotated features in genomes, the database conceptual model can be created to effective store and manage the results of the genome annotation. Based on the characteristic of “once build, more times access”, data access efficiency can be enhanced using rules of redundancy allowing, permission of relational tables and attributes varieties, entity division and so on. Through a series of measures including index building, data clustering by their coordinates, data pre-sorting, data dividing and storing the sequences in binary flat files, the database is optimized to support fast interactive performance with web tools that provide powerful visualizing and querying capabilities for mining the data. Futhermore, a code generator is developed to reduce costs of the genome database project. Genome browser, a web tool for visualized displaying and accessing of any requested portion of the genome annotation, is provided. Together with a series of aligned annotation ‘tracks’, which are also adopted by the three famous international genome browsers, the genome annotation data are visually displayed. Improvement measures, which include organizing data by centering around the annotation features, integrating data into similar-level aggregations and SVG-based interactive browsing operations, are proposed. Roaming and zooming tracks by self-adaptive steps and scales have been adopted to enhance the way of navigating the sea of genomic data. Based on high performance computing environment, an automatic gene structure annotation system for eukaryotic genomes which conceives a set of annotation heuristic programs, genome database and a web site for genome display has been available. A computational framework, which is characteristics by a two-level job load system based on grid and cluster computing, to complete the large-scale computing tasks involve in high performance computing resources has also been presented.