智能与分布计算实验室
  异构信息集成中的查询处理与优化研究
姓名 李瑞轩
论文答辩日期 2004.11.05
论文提交日期 2004.11.17
论文级别 博士
中文题名 异构信息集成中的查询处理与优化研究
英文题名 Query Processing and Optimization in Heterogeneous Information Integration
导师1 卢正鼎
导师2
中文关键词 异构信息集成系统;集成数据模型;模式映射;查询处理;查询分解;查询调度;查询优化;多自治域
英文关键词 Heterogeneous information integration system;Integration data model;Schema mapping;Query processing;Query decomposition;Query scheduling;Query optimization;Multiple autonomous domain
中文文摘 近年来,计算机和网络技术的发展呈加速态势,但各种应用的核心??数据,仍以不同形式存储在不同的系统中,分而不聚,聚而不合,呈分布异构状态。随着应用需求的不断增加,越来越多的用户希望能够透明地获取和处理来自这些海量信息源中的有用数据,实现多个软硬件系统以及不同信息源之间的互操作。然而,这些信息源物理上可能分布在异构环境的多个自治域中,有着不同的数据格式、存储方式、访问控制策略,逻辑上则可能在数据模型、操纵语言和数据语义等方面存在着很大差异。同时,这些信息源的可共享性、共享方式、共享内容等也可能随时发生变化。设计一个支持公共数据模型和统一查询语言的异构信息集成系统(HIIS),是实现这种交互操作的一种较好办法。异构信息集成系统可以屏蔽现在已有的各种异构数据管理系统不同的访问方法和用户界面,给用户呈现一个访问多种异构数据源的公共接口,提供一个集成处理多种数据源、整合多个数据查询结果的信息交互处理平台。 数据互操作是异构信息集成领域需要解决的主要问题。联邦数据库系统和多数据库系统是解决分布式异构环境中多个数据源的集成与互操作的两种方法,但这两者各有优缺点。在分析联邦数据库系统与多数据库系统差异的基础上,提出了一种基于多自治域的层次互操作模型(MDHI)。这种框架既满足了局域范围内的信息集成和处理的效率,又提供了一种集成广域范围内多种异构数据源的方法,更加符合当前实际应用的需要。 在分析异构信息集成系统基本模式结构的基础上,提出了一种基于XML的集成数据模型(XIDM)作为集成系统的公共数据模型,它将全局模式和输出模式中的数据模型描述成图的结构,可以集成包括数据库系统、文件系统、Web信息系统等多种异构系统中的数据。为了建立集成系统中不同模式层次之间的联系,给出了全局模式到输出模式之间的全局映射以及输出模式到局部模式的局部映射,解决XIDM模型与关系数据模型、面向对象模型以及HTML/XML文档模型之间的映射问题。实例证明XIDM模型及其模式映射方法是合理且有效的。 查询处理是异构信息集成系统的关键技术之一,查询分解、查询调度和查询优化是查询处理的核心内容。通过定义集成系统中查询处理的基本概念,分析XML查询的基本特点和要求,选定XQuery作为面向XIDM模型的查询语言,给出了查询处理的基本体系结构。在此基础上,给出了全局查询分解的基本原则和查询分解算法,并对查询分解算法的语义等价性进行了分析。 查询后处理是根据查询计划进行调度并通过后处理操作完成中间结果组装的过程,后处理操作主要由全局查询涉及的所有场地间运算来完成。通过对关系代数中的关系操作进行扩展,定义了面向XIDM模型的基于路径的元素簇操作,即XRA代数,用于表达查询后处理中子查询结果的合并处理。给出了查询后处理的转换规则,提出了一种连接树结构来表达集成系统的查询后处理操作,并对其进行规范化处理。通过引入连接图的概念,将连接规范树转换为等价的连接图,供查询后处理调度使用,在此基础上,给出了基于连接图的查询后处理多级并发调度算法,以尽可能提高查询后处理执行的并发性。 查询优化是异构信息集成系统中非常重要而又十分复杂的问题。针对查询后处理中的场地间运算代价,分析了影响后处理优化的代价参数,给出了局部数据源代价和通信代价的估计方法。连接运算往往是查询处理中开销最大的运算,以场地间连接和外连接运算组成的连接图为基础,给出了一种基于最小生成树的静态优化算法MST-SO和一种基于统计推理的动态优化方法SR-DO,以及结合这两种方法的混合优化策略,并通过实验仿真的方法对它们的优化性能进行了实验分析和性能比较,实验证明混合优化的性能更优。 基于上述理论和实验研究成果,研制和开发了一个基于Web服务的多自治域异构信息集成系统Panorama Web One,它能够提供对Oracle、Sybase、DB2等数据库系统以及HTML/XML文档等其他文件类数据源的透明访问,主要功能涵盖了模式集成与模式信息管理、查询处理和查询优化等方面,并通过与原有系统的对比测试实验,给出了Panorama Web One系统的性能分析与评价。
英文文摘 The development of computer and network technologies speeds up in recent years. But the data, which are the core of all applications, are still stored in different systems with different manners and live by themselves in distributed and heterogeneous environment. With the steady increase of application requirements, more and more people want to access and manipulate the useful information among multiple massive information sources and achieve the interoperability of multiple computer systems and different information sources. However, these data sources may not only geographically locate at multiple autonomous domains in heterogeneous environment with different data formats, storage modes and access control policies, but also logically differ from each other in data models, manipulation languages and data semantics. Moreover, the sharing ability, modes and contents of the sources may change at any time. So, designing a heterogeneous information integration system (HIIS) supporting the common data model and a uniform query language is a better way to implement this type of interoperation. HIIS can hide most of the differences of access methods and user interfaces of multiple heterogeneous data management systems. It also provides an information interoperating platform as a common interface to access multiple heterogeneous data sources and combine the intermediate query results from these sources. Data interoperability is one of the main problems in heterogeneous information integration. There are two approaches to solve the problem for integration and interoperation of multiple data sources in distributed and heterogeneous environment, federated database system and multidatabase system. They both have advantages and disadvantages. The dissertation presents a multi-domain-based hierarchy interoperation (MDHI) model through merging these two approaches. The framework based on MDHI model not only fulfills the efficiency requirements of the information integration and processing in local areas, but also provides a method for integrating multiple heterogeneous data sources in wide environment, which meets the real world application requirements much better. The local schemas for local data sources in HIIS are different and the dissertation presents an XML-based integration data model (XIDM) as the common data model to integrate these different schemas. The XIDM model describes the export and global schemas as the graph structure, which can integrate the data of multiple heterogeneous systems, such as database systems, file systems and web information systems. The global mappings between the global schemas and export schemas and the local mappings between export schemas and local schemas are also given. These mappings solve the problem of transformation from XIDM model to relational data model, object-oriented model and HTML/XML document model, or vice versa. The examples demonstrate the effect and efficiency of the XIDM model and the schema mapping approach. Query processing is one of the key techniques in HIIS, and query decomposition, scheduling and optimization are the central problems for query processing. The dissertation firstly defines the basic concepts of query processing and gives the architecture for it in HIIS. After analyzing the characteristics and requirements of the XML query, we choose XQuery as the query language for XIDM model. Based on the above discussion, the basic principles and algorithm of global query decomposition are given, and the semantic equivalence of the algorithm is also discussed. Post-query processing is the process of scheduling the query execution plan and combining the intermediate results according to the post-processing operations. These operations are composed of the inter-site operations related to the global queries. The dissertation extends the operations of relational algebra to define the XIDM-oriented path-based operations of element clusters, called XIDM relational algebra (XRA), which is used for representing the combination of sub-query results in post-query processing. The transformation rules and join tree structures are given to process the post-query operations in integration system, and the method for formalizing the join tree into join normal tree (JNT) is also presented. The concept of join graph is introduced to the post-query scheduling, and then the JNT can be equivalently converted into join graph. Thus, a multi-level parallel scheduling algorithm for post-query processing based on join graph is presented to improve the performance of concurrence of the query execution. Query optimization is very important and complicated in HIIS. Analyzing the cost parameters of inter-site operations, the dissertation gives the methods to estimate the costs of local data sources and inter-site communications. Generally, join operations have the most costs in query processing. The dissertation gives a minimum-spanning-tree-based static optimization (MST-SO) algorithm, a statistical-reasoning-based dynamic optimization (SR-DO) method and a hybrid optimization strategy integrating the former two approaches, which are based on the join graph composed of inter-site joins and outer joins. The simulations and experimental results show that the performance of the hybrid optimization is much better. The above theoretical principles and practical techniques are adopt for developing a Web Services based multi-domain heterogeneous information system ? Panorama Web One, which has the functions of schema integration, schema information management, query processing and optimization. It can provide the transparent access to multiple heterogeneous data sources, such as Oracle, Sybase, DB2, HTML/XML documents and other data sources. The results of performance analysis and evaluation of Panorama Web One system comparing with the old system ? Panorama are also reported.