生物信息学中蛋白质结构分析的计算方法研究-智能与分布计算实验室

生物信息学中蛋白质结构分析的计算方法研究

姓名	李其申
论文答辩日期	2005.05.10
论文提交日期	2005.05.16
论文级别	博士
中文题名	生物信息学中蛋白质结构分析的计算方法研究
英文题名	Research on Computational Methods for Protein Structure Analysis in Bioinformatics
导师1	卢正鼎
导师2	陈传波
中文关键词	生物信息学;几何学;蛋白质结构;连续模型;多分辨率分析;结构比对;曲线匹配;动态规划
英文关键词	bioinformatics;geometry;protein structure;continuous model;multiresolution analysis;structural alignment;curve matching;dynamic programming
中文文摘	生物科学技术与计算机科学技术的迅速发展，孕育了一门新的学科??生物信息学。特别是人类基因组计划（Human Genome Project, HGP）的顺利实施和完成，极大地推动了这门新生学科的迅速发展。作为破译“生命天书”的新兴学科，生物信息学引起了人们广泛的研究兴趣。生物信息学的根本任务是从生物数据中发现知识，揭示生命的秘密与本质。作为生命物质基础的蛋白质，是各种生命活动的主要承担者，自然有着极为重要的研究意义和价值。蛋白质空间结构因为与其序列和功能的密切关系，在生物信息学中有着重要的研究地位。由于在空间的扭曲、缠绕，蛋白质的空间结构变得异常复杂，使人们一时难以解开其“折叠密码”。现有的蛋白质结构模型，如距离矩阵、格点模型、连接图等，都属“离散型”模型。根据蛋白质分子的链式结构特点，运用B样条空间曲线理论，提出了一种新的蛋白质结构分析模型??B样条结构曲线，即BSSC（B-Spline Structure Curve）模型。它是一种“连续型”模型，其重要意义在于，可以使蛋白质结构相关问题的研究，运用连续数学的理论和方法来进行，从而克服了“离散型”模型的局限性。同时，BSSC模型具有连续性、可控性和可扩展性等特点。利用BSSC模型的这些特点，可将其扩展运用到诸如分子进化、蛋白质折叠等多个生物信息学问题的研究。比较是科学研究的根本方法之一。对蛋白质的结构进行比较（在生物学意义上称之为比对）是预测其功能、分析其进化关系等问题的重要方法和途径。对应结构的空间叠合是蛋白质结构比较的基础。运用矩阵理论和奇异值分解方法，给出了一个加权对应结构的最优刚体叠合定理。该定理不仅给出了最优刚体叠合空间变换的旋转矩阵R和平移向量t，而且给出了直接计算空间叠合最小均方差RMSD（Root Mean-Square Deviation）的表达式。蛋白质结构比较（对）是生物信息学中的一个重要问题，其目的是通过数据库查询从蛋白质结构数据库中查找结构同源或结构类似的蛋白质分子。蛋白质结构比对是一个NP难问题。针对这个问题，人们已经提出了不少解决方法。但由于问题本身的复杂性，目前还没有一种方法被普遍认可。利用蛋白质结构BSSC模型，基于曲线匹配（Curve Matching）的思想，并结合动态规划（Dynamic Programming）算法，提出了蛋白质结构比对的CMDP（Curve Matching and Dynamic Programming）算法。该方法将曲率和挠率两个几何不变量作为特征描述对，把三维的结构曲线转换为一维的数值对字符串，从而简化了问题的难度。在分析曲线匹配问题中，提出了形状匹配二值打分矩阵??BSM（Binary Scoring Matrix of Shape Matching）的概念。利用这个打分矩阵，可以方便地提取两条结构曲线的最大形状匹配片段。在最大形状匹配意义下，将两个被比较蛋白质结构进行空间叠合，然后利用序列比对中常用的N-W动态规划法来提取两叠合结构的对应残基。N-W动态规划法具有允许比对结果序列中存在插入和删除，并能找出全局最优排列的特点。在运用N-W动态规划法提取两叠合结构对应残基的方法中，提出了另外一个打分矩阵??距离打分矩阵SMD（Scoring Matrix of Distance）。这个矩阵描述的是在给定的距离阀值意义下两叠合结构上残基对间的距离远近。为优化调整结构比对的结果，引入了数值计算中常用的迭代优化机制。实验表明，CMDP算法在计算时间和比对结果上，均得到了令人满意的结果。在观察和分析蛋白质结构时，其异常的空间复杂性增加了问题的难度。为此，基于蛋白质结构的BSSC模型，利用空间曲线的小波分解方法，提出了蛋白质结构的多分辨率层次描述方法，从而提供了一种在不同分辨率层次上对蛋白质结构进行观察和分析的思路和工具。多分辨率分析思想为蛋白质结构相关问题的研究，特别是蛋白质分子体系的进化研究开辟了一条新途径。值得指出的是，在方法论上为蛋白质结构相关问题的研究提供了两种新的研究思路：建立连续数学分析模型和运用多分辨率分析思想。生物信息学作为一门新生学科，其理论基础和研究方法还很不成熟。运用各种理论和方法，从不同的角度对其进行分析和研究，是一项既有重大意义又有挑战性的工作。
英文文摘	Bioinformatics was gestated to be a new subject with the rapid development of science and technology of biology and computer. Especially the enforcement and accomplishment of the Human Genome Project (HGP), has greatly promoted the development of bioinformatics. As a new subject of uncovering the secret of life, bioinformatics has aroused many people’s interests in scientific study. The underlying task of bioinformatics is to obtain knowledge from the biological data and to reveal the essence of life. Proteins are the material bases of life and the main undertakers of various life activities, which have great values and significances to be studied. Due to the close correlations with sequence and function, the spatial structure of the protein is very important in the research of bioinformatics. However, it is difficult for people to decode the folding code of protein because of its extraordinary complexity, which is derived from its distortion and twist in space. The existing models for protein structure are all discrete models, such as distance matrix, lattice, contact map, and so on. In this paper, a new model of B-Spline Structure Curve (BSSC) is proposed for protein structure based on its chain trait and the theory of B-spline spatial curve. BSSC is a continuous model, whose special significance is that the theories and methods of continuous mathematics can be adopted in the study of protein structure and the limitations of discrete models are overcome. Furthermore, BSSC has the characteristics of continuity, controllability and extendibility, which can be used in the research fields of molecular evolution, protein folding, and so on. Comparison is one of the essential methods in scientific research. To compare the structures is an important approach to predict the functions and to analyze the evolutionary relationships for proteins. The precondition of protein structure comparison (alignment) is the spatial superimposition of the corresponding structures. An optimum theorem is presented for the rigid superimposition on the base of matrix theory and singular value decomposition, which not only decides the rotation matrix R and the translation vector t of the spatial transformation, but also produces the expression of the minimum Root Mean- Square Deviation (RMSD) directly. The Protein Structure Alignment (PSA) is a NP-hard problem. Although many solutions have been proposed to this problem in the past years, no method is generally accepted up to now because of its complexity. On the basis of BSSC, a new approach named CMDP is presented for PSA by Curve Matching and Dynamic Programming. This method takes the curvature and the torsion of each referential position on BSSC as the shape signature-twins to transform the three-dimensional structure curve to a one- dimensional twins-string. Thus, the PSA problem is simplified. In order to solve the problem of shape matching, a Binary Scoring Matrix of Shape Matching (BSM) is proposed to conveniently detect the largest fragment of shape matching in two BSSCs. According to the maximal matching fragments detected, two proteins compared are superimposed and the equivalent residues are extracted by the algorithm of N-W dynamic programming, which permits insertions and deletions in the alignment and obtains the global optimum alignment simultaneously. Another scoring matrix ? SMD, Scoring Matrix of Distance, is proposed to describe the distances of residue-pairs of the superimposed structures under the given threshold. In order to optimize the alignment, the mechanism of iterative optimization is employed in the CMDP. The spatial complexity of proteins brings much trouble to observe and analyze the structures. Based on the BSSC and the wavelet decomposition of spatial curve, a multiresolution representation of protein structure is presented, which is a novel idea and method to observe and analyze the protein structures from multiresolution perspectives. It is especially useful to study the evolution relationships of the protein universe. It is worth to point out that the dissertation offers two novel ideas for the study of protein structure: one is to set up the continuous analysis model and the other is to apply the multiresolution analysis. As a new subject, the theory and method of bioinformatics is not well-founded. It is significant and challenging to use various theories and methods to study bioinformatics from different perspectives.