智能与分布计算实验室
  基于Web的多数据源数据集成语查询研究
姓名 张素智
论文答辩日期 2003.10.16
论文提交日期 2003.10.16
论文级别 博士
中文题名 基于Web的多数据源数据集成语查询研究
英文题名 Web Data Integration from Multiple Data Sources and Query Research
导师1 卢正鼎
导师2
中文关键词 数据集成;多数据库系统;数据抽取;嵌套关系代数;查询处理
英文关键词 data integration;multidatabase system;data extraction;nested relational algebra;query process
中文文摘 随着网络技术,特别是Internet技术的迅速发展和企业应用需求的不断深入,越来越多的用户希望能够同时访问和处理来自多个数据源的数据。WWW的迅速发展,为全球信息传递和共享提供了便捷的手段,成为日益重要和最具潜力的资源。Web中包含了大量的异构信息和服务(Services),已成为金融、商业、教育、医疗卫生等各种领域中事实上的支撑环境。Web数据具有全面、及时、多样化的特点,但往往也是高度自治的,许多半结构化数据不可能按传统数据库中结构化数据来处理。基于Web的多数据源数据集成系统的主要目标就是让不同的Web数据源协同工作,为全局用户提供友好的查询界面,使他们能方便地访问所需要的信息。研究基于Web的多数据源数据集成是现代企业构建B2BI应用的关键技术,对企业的发展将产生深远的影响。 在充分吸收国内外数据集成领域研究成果的基础上,通过分析分布式对象技术、Agent技术和XML技术,提出了一个以多数据库系统为主要技术路线,基于XML和CORBA的Web数据集成系统框架。该系统框架采用CORBA作为对象模型,XML作为数据模型,用XML查询语言对Web上异构多数据源中的数据进行集成,即将Web作为一个巨大的数据库,采用数据库方法对Web数据进行集成管理。由于采用CORBA来处理平台层的异构性和实现的透明性,采用XML作为公共数据模型实现数据的统一表示,因此具有很好的灵活性和优越性。 Web数据集成技术包括CORBA与Web集成技术、公共数据模型(CDM)和元数据字典(MDD)、模式集成和转换以及全局集成视图的设计。研究了目前相关的Web技术、Web对象技术以及CORBA与Web的集成方式,在此基础上提出了数据库在CORBA和Web环境下的连接方式和利用Java构造Web上CORBA应用的方法。定义了XML数据模型,给出了局部数据模型到XML数据模型的转换算法。使用XML Schema对元数据字典(MDD)模型进行设计,并通过MDD管理程序对数据源的元数据进行管理。分析了模式集成的概念、模式集成的目标以及模式集成需要解决的问题,提出了在Web数据集成系统中通过设计Wrapper/Agent来解决模式集成中语义和结构异构性的方法。 在网页数据源数据抽取方面,提出了在模式导引下的自动数据抽取,即Awdgs系统。该系统采用了四阶段的数据抽取策略,通过定义抽取模式,所产生的Wrapper程序能更精确地抽取所要求的数据。考虑到整个集成系统全局模式和查询要求,Wrapper程序的输出结果是遵循抽取模式的有效XML文档。采用了归纳学习算法,对于集成网页中的表格数据和列表数据,有一定规律可循,因此算法是有效的。 多数据源数据集成系统中的查询处理技术分为三部分:① 查询处理的基础知识;② XML查询代数和查询语言;③ XML和数据库的转换技术,包括XML数据(库)和关系数据库的存储转换技术和XML查询与SQL查询的转换技术。引入了嵌套关系代数(NRA),通过NRA和扩展规则表达式的操作符来描述XML查询的语义。明确了XML数据库的概念,并着重讨论XML数据的存储、查询分解和查询转换技术。描述了由XQuery所定义的全局查询分解方法,提出了XML查询向SQL转换的两个步骤,并分析了其中的关键技术和实现算法。 研究了Web服务技术的相关内容,分析了在集成系统中对采用Web服务的考虑。Web服务为企业的系统集成提供了新的框架,比传统的分布式计算技术(如COM/DCOM,CORBA和EJB)在简单性、开放性、灵活性、动态性和高效性等方面具有明显的优势,为进一步的研究指出了方向,但是企业是否采用Web服务技术也要慎重考虑。 描述了自行开发的一个扩展多数据库原型系统Panorama的功能和实现。该系统采用基于CORBA的分层体系结构,由全局代理和局部代理协同完成具体业务流程,采用XML数据模型对全局模式进行定义、存储和管理,具有模式集成、查询处理和事务处理等功能,能对Oracle、Sybase和DM2等成员数据库提供透明互操作。
英文文摘 With the increasing of application requirements and development of network technology, such as Internet, more and more users expect to access and manipulate the data from multiple data sources. WWW is becoming important and potential resources for delivering and sharing information over the world. The resources on the Web involve not only conventional database, such as relational database and object-oriented database, which have well-form data model, but also unstructured and semi-structured data and Web Services. It is difficult to store and manage all the Web data with conventional database technology, because of its irregularity and various forms of the web data. The purpose of integrated system on multiple data sources based Web is cooperating with diverse Web data sources, providing a friendly query interface that can make global user access data from Web data sources more easily and conveniently. How to integrate the data from distributed, heterogeneous, and multiple data sources on the Web into an available whole is a full of challenges and urgent work to many application and enterprises, and also is key technologies to create Business-Business Integration (B2BI). Based on summarizing series of important data integration research works and analyzing developments of several main distributed object technologies, agent technology and XML, this research presents an architecture of Web data integration with XML and CORBA. In the architecture, we view Web as a huge virtual database, and take CORBA as the object model and XML as mediated data model, and use XQuery as XML querying language to accomplish data query and integration on the Web. We also elaborate and analyze some implementing methods, such as integrator and wrappers corresponding to various data sources. Contrasting with others early data integration system, It is flexible and predominant. Web data integration includes integration CORBA with Web, common data model (CDM) and metadata management, schema integration and global view designment. We study Web technologies, distributed object technologies and integrated methods of CORBA and Web, and present the way to connect distributed database in the system and steps for creating CORBA applications in Java. With defining XML data model, we present the conversion algorithm between local data model and XML data model, and metadata dictionary (MDD) model used to manage metadata about multiple heterogeneous data sources. We study schema integration technologies and present the solution for syntactic and semantic heterogeneity. With the development of Internet, Web has become invaluable information source. Wrappers have been a useful technique to convert HTML documents into semantically meaningful XML. We propose an automatic Web data extraction guided by schema, that is Awdgs, which generates automatically a wrapper to extract data from an HTML document, and produces an XML document conforming to given DTD. After the user defines extraction data schema in the form of DTD, the wrapper is generated automatically with the induction and leaning algorithm. The experiment indicates that the approach can correctly extract the required data from the source document with high accuracy. We concentrate on querying process within Web data integration system in the dissertation, which is composed of three parts: (1) basis of distributed querying process; (2) XML query algebra and XML query language;(3) transformation between XML and database, including data storage transformation technologies and transformation from XML query to SQL. We introduce nested relational algebra (NRA), which used to describe XML querying semantics. We also introduce a new concept -XML database. At last, we state decomposition of global query defined by XQuery, and key technologies and algorithm of transforming XML query to SQL. We study Web services, which are new kinds of integrated system architecture. A web service has special behavioral characteristics, it is XML-based, loosely coupled, coarse-grained et al. We consider and analyses the tradeoff between taking web services in integration system or not. Finally, we introduce an extended MDBS prototype that designed and implemented by system integration group in HUST, named Panorama. Panorama can completely perform schema integration, query processing and transaction processing, and therefore provides transparent interoperability among existing database management systems such as Oracle8, Sybase, and DM2.