异构信息集成环境中基于语义的查询研究-智能与分布计算实验室

异构信息集成环境中基于语义的查询研究

姓名	王小刚
论文答辩日期	2006.05.09
论文提交日期	2006.05.14
论文级别	博士
中文题名	异构信息集成环境中基于语义的查询研究
英文题名	Semantic-based Query in Heterogeneous Information Integratiuon Environment
导师1	卢正鼎
导师2
中文关键词	异构信息集成;语义模型;本体;查询处理调度;查询优化
英文关键词	Heterogeneous information integration, Semantic model, Ontology, Query processing scheduling, Query optimization
中文文摘	随着计算机技术的发展和数据库的广泛应用，基于多种异构数据源的互操作需求越来越多。由此上世纪90年代人们提出了多数据库等结构化信息集成框架，然而近十年来，网络中可供利用的信息总量以惊人的速率增长，另一方面应用需求的变化表现得更加频繁，面对海量的信息和不断变化的应用需求，结构化集成方案表现出一定的局限性。主要表现在对信息语义无法实现机器处理，集成过程复杂、自动化程度低，无法适应新一代应用的需求。传统的多数据库信息集成方案需要手工建立从局部信息空间到全局信息空间各级模式间的映射，互操作的共享语义也没有形式化，部分语义依然以代码逻辑的形式呈现，从系统论的角度来说集成系统未来可能需要集成为更大的超系统，这些代码逻辑依然会成为集成的障碍。针对这些问题提出了一种基于本体映射的语义异构信息源集成框架(HISIM)，该集成方法针对动态的应用需求自动合成互操作共享语义，实现集成过程的高度自动化，更加符合当前实际应用的需要。本体作为一种有效的语义建模方法，在智能信息集成领域得到广泛的关注和应用。互操作信息源间缺失的语义和信息源应用系统中蕴涵的代码逻辑语义都需要一种声明式的、形式化的描述。为了在不同本体之间构建代数系统来解决异构信息集成中的语义冲突，提高信息集成的正确性、一致性和有效性，给出了本体的形式化定义和代数结构描述。鉴于XML已成为互连网上事实上的数据交换标准，给出了基于XML的本体表示和概念的检索方法。在给出了本体的分类及全局本体、用户本体、局部本体的定义后，根据一个具体的实例给出了局部本体到全局本体的映射规则，该方法具有一定的普适性，转换规则通常以一种机器可识别和处理的方式表达，可以提高集成工作的智能化、自动化程度。通过模拟人类记忆和联想的生理特点，给出了全局本体的语义关联模型；基于全局本体的语义关联模型，利用语义知识单元间纵横向语义关联实现了语义知识单元的查询，并给出了具体的算法，由此可以检索互操作局部本体间的相关本体，依据相关本体、局部本体和全局本体给出了用户本体的合成算法，用于用户本体的生成。查询处理是根据查询计划进行调度，并通过查询处理操作完成中间结果组装的过程，查询处理操作主要由全局查询涉及的所有场地间运算来完成。通过分析本体映射语义给出了查询处理的类型和转换规则，提出了一种连接树结构来表达集成系统的查询处理操作，并对其进行规范化处理。通过引入查询图的概念，将连接规范树转换为等价的查询图，供后查询处理调度使用，并给出了基于查询图的查询处理多级并发调度算法，以尽可能提高查询处理执行的并发性。在分析了影响查询处理优化的代价参数后，给出了局部数据源代价和通信代价的估计方法。笛卡儿积往往是查询处理中开销最大的运算，会产生大量的无效元组，应尽量避免，以场地间连接和外连接运算组成的查询图为基础，给出了一种基于线性序列的静态优化算法LOS和一种基于统计推理的动态优化方法SRD，并通过实验仿真的方法对它们的优化性能进行了实验分析和性能比较，实验结果验证了它们的有效性。
英文文摘	With the development of computer techniques and the wide application of database systems, an emerging demand for interoperating and programming across heterogeneous data sources has occurred. In the 90s last century structured information integration architectures were offered by some scholars, e.g. multi-database system. However the amount of information available on-line is proliferating at a tremendous rate during the last decade. On the other hand the application demands vary frequently. Structured information integration architecture shows some localization in the face of expanding information and frequently varying application demand. Semantic of information isn’t processed by machine in the architecture; the integration process is complicated and manual. It can’t meet new generation application. In the integration process of tradition integration method, developer must manually build mapping between some schemas from local information source space to integration information space. Shared semantic of interoperation isn’t formalized yet. Some semantic is still denoted in code logic. In the future the integrated system maybe becomes a member system of a larger integration system. Semantic of this code logic will still become obstacle of integration. Focusing on the problems we design a heterogeneous information integration architecture (HISIM) based on semantic of ontology mapping. Shared semantic of interoperation is formalized in HISIM. Semantic being processed by machine is realized; on the other hand facing the dynamic application requirement HISIM automatically construct shared semantic of interoperation, the process of integration is automatized. The new integration method meets the need of application more. As ontology is an effective method of semantic modeling, people pay more and more attention to it in intelligent information integration domain. The absent semantic in interoperation and the semantic in application code of information source need to be formalized and declared. A formal definition of ontology and ontology algebra are offered to solve semantic conflict between information sources. These formalizations improve correctness, consistency and validity of integration. Considering XML has actually become standard of data exchange on the Internet, we give an ontology denotation and conception search method based on XML. After giving ontology taxonomy and definition of domain ontology (DO), user ontology (UO), local ontology (LO) etc., we introduce how to build mapping rules between DO and LO. These rules are denoted in some manner so as that can be processed by machine. This method improves intelligence and automatization of integration process. On the basis above we give a semantic associated model of domain ontology, which simulates the physiological feature of human being’s memory and association. According to semantic association in length and breadth we realize semantic knowledge unit query and provide the query algorithm. Using the algorithm we can find association ontology (AO, which is semantic bridge between some LOs) from DO. Then we provide an algorithm that merges AO, GO and LO to UO, which is necessary for interoperation. Query processing is the process of scheduling the query execution plan and combining the intermediate results according to the query processing operations. These operations are composed of the inter-site operations related to the integration queries. By analyzing semantic of ontology mapping we can obtain the transformation rules and query processing operations, and we provide a join tree structure to denote the query processing operations in integration system. And the method for formalizing the join tree into join normal tree (JNT) is also presented. The concept of query join graph is introduced to the query processing scheduling, and then the JNT can be equivalently transferred to query join graph for the scheduling of query processing. Thus, a multi-level parallel scheduling algorithm for query processing based on query join graph is presented to improve the performance of concurrence of the query execution. Analyzing the cost parameters of query processing, we present the methods to estimate the costs of local data sources and inter-site communications. Generally Cartesian Product, which can generate many more invalid tuple, has the most costs in query processing. Query processing should avoid it. Based on the query join graph composed of inter-site joins and outer joins we offer a static optimization (LOS) algorithm based on a linear-order, and a statistical-reasoning-based dynamic optimization (SRD) method. We analyses performance through simulations and experimental. Results show that two optimization strategies are effective.