智能与分布计算实验室
  面向服务的数据挖掘关键技术研究
姓名 李玉华
论文答辩日期 2006.11.08
论文提交日期 2006.11.21
论文级别 博士
中文题名 面向服务的数据挖掘关键技术研究
英文题名 Study of Service-oriented Data Mining Key Techniques
导师1 卢正鼎
导师2
中文关键词 数据挖掘;面向服务的体系结构;本体;语义数据集成;隐私保护;质量评价;服务选择
英文关键词 Data mining;Service Oriented Architecture;Ontology;Semantic data integration;Privacy protection;Quality evaluation;Service selection
中文文摘 解决大规模分布异构数据挖掘问题,需要一种便于资源集成、提供高质量数据挖掘服务,并具有较高安全性和隐私保护性的框架模型。面向服务的体系结构(Service Oriented Architecture,SOA)、本体、WEB服务等新技术将为数据挖掘系统的开发提供更强大的技术支持。根据分布异构环境下数据挖掘的特点,综合利用SOA、本体、WEB服务等新技术,遵循以用户为中心的理念,提出了一种开放式的面向服务的数据挖掘系统框架——SODMA。该框架将数据挖掘的算法封装成WEB服务,利用异构数据集成本体和隐私保护策略本体实现应用域的隐私保护语义数据集成,利用以用户为中心的数据挖掘本体和数据挖掘服务质量评价本体帮助用户动态选择适用且高质量的数据挖掘服务(DMS),可在分布环境下为不同领域多层次用户提供高可用性、高性能、高质量、安全的数据挖掘服务。 对于分布异构的数据挖掘,异构数据集成是数据预处理很关键的第一步。数据集成的模型需要有效的解决数据异构性、完整性、权限控制、集成范围限定等问题。在分析现有的数据仓库方式、中间件集成、基于本体的数据集成方法优缺点的基础上,借鉴已有的语义数据集成和隐私保护数据挖掘的研究成果,提出了一个基于智能体和本体的隐私保护语义数据集成模型,以解决应用域隐私保护数据挖掘数据预处理的问题。定义了隐私保护数据挖掘本体,数据集成采用全局视图(GAV)和局部视图(LAV)相结合的混合本体集成方法,隐私保护策略集成采用单本体的方法,同时利用模式模糊化和角色模糊化,以提高模型的隐私保护性。 数据挖掘服务是涉及数据、计算、挖掘知识的复杂服务应用,用户需要具备非常全面的专业知识才能正确使用。现有的以系统为中心的设计中,数据挖掘解决方案特别重视算法和系统工程,而没有首先探讨最终用户将如何方便地使用新的数据挖掘技术,使系统难于操作和使用。有的系统利用数据挖掘本体和预测执行时间的方法来帮助用户选择正确并且高质量的数据挖掘服务,但是数据挖掘本体只是对数据挖掘的方法进行枚举,无法保证服务的质量。在分析和总结了前人对于数据挖掘技术和系统研究成果的基础上,结合数据挖掘应用的领域知识,遵循以用户为中心的设计理念,提出了以用户为中心的数据挖掘本体,一方面根据数据挖掘功能和挖掘对象来组织数据挖掘算法,另一方面根据应用领域知识为用户提供有效的数据挖掘应用解决方案,帮助不同领域多层次用户方便选择数据挖掘服务。此外还进一步讨论了基于本体描述语言(OWL)的数据挖掘本体实现。 具体的数据挖掘算法和域应用解决方案是数据挖掘本体的实例,是用户应用的核心。研究了反洗钱领域数据挖掘应用解决方案实例,包括应用域的层次,若干基本的数据挖掘算法,数据挖掘应用解决方案和所用算法间的映射等。主要实现可疑交易甄别、交易网络分析和洗钱模式发现等数据挖掘应用,给出了可视化链接分析方法,可实现交互式可视化的交易网络分析,提出了基于图熵的链接发现算法,可有效地发现交易网络的关键节点,给出改进的基于Apriori的SLAGM频繁子图发现算法,用于交易网络的结构分析。 用户可利用域数据集成本体提供的语义模型,在数据挖掘本体的指导下选择数据挖掘算法和应用解决方案以定义具体的数据挖掘任务。在用户需求获取完成以后,接下来就是要根据用户需求选择合适的数据挖掘服务执行,而大多数最终用户并不具备这样的专业知识。从方便用户的角度出发,系统需提供一套服务选择机制,来帮助用户选择高质量的数据挖掘服务。系统综合通用WEB服务的评价标准、数据挖掘领域的专用评价因子及用户评价反馈等多种因素及服务的动态性,给出了一个较全面的数据挖掘服务评价本体,讨论了服务质量的评价方法,给出了基于服务质量评价的动态数据挖掘服务选择方法,用户可根据数据挖掘服务评价本体的语义模型,输入质量约束条件,也可以调整评价因子权值,系统在满足用户约束条件的服务集中,通过计算出服务的综合质量值,挑选最适合的算法执行。 基于上述成果,实现了一个外汇反洗钱领域的隐私保护数据集成和数据挖掘服务选择的原型系统,并总结了系统设计特点。
英文文摘 In order to solve the problem of large-scale distributed data mining (DM), one kind of architecture is needed to facilitate data resource integration, offer high quality data mining service (DMS) with higher security and privacy protection. It will provide strong technology support to DM system development that Service Oriented Architecture, ontology and WEB service. According the feature of distributed DM and following user-centered idea, a Service-Oriented Data Mining Architecture ?SODMA is proposed by applying Service Oriented Architecture, ontology and WEB service and so on. SODMA will pack the DM algorithms to WEB services, to realize privacy protection data integration using data integration ontology and privacy protection ontology, help mult-level users in different domain select appropriate and high quality DMS dynamicly with the help of user-centered DM ontology and DM quality evaluation ontology, offer high usability, high performance, high quality and security DMS in distributed heterogeneous environment. Heterogeneous data integration is the first key step of data preprocessing for distributed heterogeneous DM. The model of data integration needs to solve heterogeneity, integrity, privilege control and scale restriction. Based on the characteristic analysis of data warehouse, middleware integration and ontology-based data integration, referencing the achievement of existing semantic data integration and privacy protection DM, one data integration modle is offered based on agents and ontology which can implement effectively semantic data integration based privacy protection. In the model one privacy protection policy ontology is defined, privacy protection policy integration uses single ontology approach, data integration adapts hybrid ontology approach combinding global-as-view and local-as-view, schema obfuscation and role obfuscation help improving privacy protection. DMS is a complicated intensive application involved data, computation and knowledge, and requires professional domain knowledge to use. Existing “system-centered” DM solutions often focus heavily on algorithms, systems engineering challenges, without first thoroughly exploring how end-users will employ the new DM technology and make the system hard to operate and use. Some systems help user selecting proper and high quality DMS with DM ontology, but DM ontology only enumerate the DM algorithms and can’t ensure high quality of services (QoS). Baesd on the production of DM technology and system, following user-centered idea, the user-centered data mining ontology is presented which not only offers abundant data mining algorithms for different function and different type of handling data, but also provides multiple data mining application solutions for different application domains. It can help the multi-hierarchy users in different domains select their DMS easily. It is also discussed that the implementation of the ontology based on WEB ontology language (OWL). The DM algorithms and domain application solutions is the instances of the user-centered data mining ontology and the core of the user application. An example solution for money laundering is introduced, including the hierarchy of the domain, some DM algorithms and the mapping of application solutions and their used algorithms. There are 3 main application solutions: identification of suspicious money laundering trades, trade network analysis and money laundering mode mining. A visual link analysis is proposed to make trade network analysis interactively and visually. Link Discovery Algorithm Based on Graph Entropy is put forward to identify critical nodes from the complex networks of money laundering crime. A improved Frequent Subgraph Discovery Algorithm Based on Apriori Idea is presented to mine frequent subgraphs from simple graphs efficiently which can be used for structure analysis of trade network analysis and mining new money laundering modes. User can define DM task using domain data integration ontology and DM ontology. Next select proper DMS, but most users haven’t such professional knowledge. One service selection mechanism is needed to help user selecting high quality DMS in view of user usability. A more all-around DMS Quality Evaluation ontology(OntDMQ) is proposed by synthesizing WEB service QoS, DM unique characteristic, subjective factor such as user feedback and service dynamic characteristic. The evaluation method of QoS is discussed. The QoS-based dynamic service selection method is presented that user can define the QoS constraint of DMS referencing OntQE and adjust the factor fff ,the system select the most appropriate DMS in the services fiting the user requirements according to computing compositive quality value. Found on above achievements, one prototype system of privacy protection data integration and DMS selection has been finished in foreign exchange money laundering domain. The system characteristic is summarized.