智能与分布计算实验室
  集成半结构化数据的多数据库系统关键技术研究
姓名 邓曦
论文答辩日期 2003.05.09
论文提交日期 2005.10.18
论文级别 硕士
中文题名 集成半结构化数据的多数据库系统关键技术研究
英文题名 Key Technologies of Semistructured Data Integration in Multidatabase
导师1 卢正鼎
导师2
中文关键词 多数据库系统;半结构化数据;公共数据模型;查询优化;数据抽取
英文关键词 multidatabase system;semistructured data;common data model;query optimization;data extraction
中文文摘 多数据库系统为有效地集成多个分布、异构和自治的数据库提供了很好的解决办法。但网络高速发展的今天,多数据库系统要管理的对象不再局限于传统数据库中的结构化数据, 半结构化数据的管理需要对多数据库技术提出了新的挑战。如何有效地集成结构化和半结构化数据,以一个统一的视图提供给用户,成为亟待解决的研究课题。 在分析综合分布式对象技术、多数据库技术和XML技术的基础上,给出了一种分布异构数据源的集成方法。该方法以多数据库系统作为其技术基础,采用CORBA中间件作为其物理模型框架,采用面向XML的公共数据模型来表达结构化与半结构化数据,通过局部代理的包装屏蔽它们在语义和功能上的差异。 XML作为新一代的Web数据交换语言具有很好的表达能力和广阔的应用前景。针对XML的特点对OIM对象模型进行扩展,给出了一种面向XML的扩展多数据库的公共数据模型XOIM及其对象代数。该模型是有序的有向图模型,非常适合表现结构化和半结构化数据。 半结构化数据的集成也对多数据库系统的查询优化提出了新的要求。由于各数据源的查询能力差别较大,研究了基于数据源查询能力的代数优化方法,并对查询后处理的调度和优化问题给出了一种基于多元线性回归模型的动态优化算法。 基于上述方法,作为主要成员设计实现了Panorama扩展多数据库原型系统。在该系统中实现了半结构化数据源局部代理的基本功能,能对HTML网页进行数据抽取并生成局部模式,有效地提供对半结构化数据源以及Oracle、Sybase和DB2等数据库系统的透明互操作。
英文文摘 Multidatabase system has rendered a very good solution to the integration of several distributed, heterogeneous and autonomous databases.But with the rapid development of the networks, multidatabase has to manage not only the structured data in traditional databases but semistructured data as well.The integration of semistructured data is a new challenge to multidatabase technology.How to obtain and integrate structured and semistructured data and forward an integrated view to the users has become a very important issue in this research area. After analyzing the research achievements in distributed object computing technology, multidatabase technology and XML, we put forward an approach to integrate data from distributed and heterogeneous data sources. Founded on multidatabase, this approach uses CORBA as its physical infrastructure and a XML oriented common data model to represent structured and semistructured data, solving the semantic and functional differences between them through local agents. As a new Web data exchange language, XML has good expression abilities and a bright future in applications. By extending OIM data model according to the characteristics of XML, we give a XML oriented common data model XOIM and its object algebra, which has an orded graph structure thus a great power to represent structured and semistructured data. There is also a need to extend the multidatabase query optimization architecture because of the integration of semistructured data. We discuss the query optimization based on the query ability of each data sources and also a dynamic optimization algorithm based on multi-regression cost model for the schedule and optimization of post-processing query.   We designed and implemented an extended multidatabase system prototype, named Panorama, on above methods. A detailed implementation of the local agent of semistructured data sources in Panorama is introduced, which can extract data from HTML Web pages and generate a local schema, therefore provides transparent interoperability among semistructured data sources and existing database management systems such as Oracle8, Sybase, and DB2.