离群数据挖掘算法及其在反洗钱中的应用研究-智能与分布计算实验室

离群数据挖掘算法及其在反洗钱中的应用研究

姓名	王琼
论文答辩日期	2005.05.11
论文提交日期	2005.05.13
论文级别	硕士
中文题名	离群数据挖掘算法及其在反洗钱中的应用研究
英文题名	Research on Outlier Detection Algorithm and its Application in Anti-Money Laundering
导师1	卢正鼎
导师2
中文关键词	数据挖掘;离群数据;事物间关联规则;离群模式;反洗钱
英文关键词	Data Mining;Outliers;Intertransaction Assiciation Rule;Outliered Pattern;Anti-Money Laundering
中文文摘	离群数据挖掘是数据挖掘中的一个重要领域，主要用来发现数据中的小模式，它在各种领域有着非同寻常的应用意义。传统的离群数据挖掘研究都试图寻找一个统一的定义和挖掘模式，忽略了其应用相关的本质。运用组件的思想，融合在反洗钱应用的实际需求，给出了三种新算法。 OSVMOD(One-class Support Vector Machine Outlier Detection)算法将机器学习元素成功地组建到离群数据挖掘算法中，利用一类支持向量机原理得到离群定义，在黎曼几何的基础上构造数据依赖的核函数，优化决策函数，最后扫描数据库一次分离出离群记录。算法避免了传统算法中对人工先验知识及参数设定的需要，使得监测工作真正地做到自动化；核心算法仅作用于小样本集，大大缩减了处理时间。离群规则挖掘算法跳出的传统的纯数字特征的框框，将最新提出的事务间关联规则概念运用到离群的定义，OARD(Outliered Assiciation Rule Detection)算法基于传统Apropri算法原理挖掘出支持度小于阈值的离群规则，运用频繁项集链接表、数据编码和哈西树等数据结构优化存储，具有很好的实际性能；DBOARD(Desity Based Outliered Assiciation Rule Detection) 算法则定义关联密度来描述事务间关联关系所出现的规律，并在关联密度序列上运用偏差算法挖掘出离群数据。以往的研究很少关注离群数据的语义，使得挖掘结果毫无亲和力。基于相似度的离群模式挖掘模型定义了描述离群数据语义的知识集和描述离群数据组间联系的相似度，首先运用改进的DBOD(Distance Based Outlier Detection)挖掘算法进行局部离群数据挖掘，在此结果上计算离群点的知识集，最后通过计算相似度矩阵得到大于阈值的离群模式。算法具有线性复杂度，挖掘结果易于理解。在以上研究分析的基础上，设计实现国家外汇信息管理决策原型系统中的反洗钱离群数据挖掘系统，系统能有效地对业务数据进行监测预警。
英文文摘	Outlier Detection is an important field of data mining that deals with small patterns.It has significant application in many fields.Traditional research tried to find a general outlier definition and mining pattern while ignored its application-dependent nature.Based on the thought of component and drived by the practical needs of Anti-Money Laundering,three new outier detection solutions are proposed. The OSVMOD(One-class Support Vector Machine Outlier Detection) algorithm absorbed the machine learning element,constructed data-dependent kernel function based on Riemannian Geometry to optimize the decision function,then detect the outliers through scan database once.It doesn't require transcendent knowledge and artificial parameters,realizes the autoaudit and management.Kernel part of the OSVMOD only works on the sample set of small size,reduced the executing time. The Outliered Rule Mining algorithm jumped out of the traditional figure only confine, using the most recent concept of intertransaction rule to define outliers.The OARD(Outliered Assiciation Rule Detection) algorithm mined the outliered association rules based on traditional Apropri algorithm,optimize the storing structure using Frequent Itemset Link List,data coding and Hash Tree.DBOARD(Desity Based Outliered Assiciation Rule Detection) defined the association density to describe the frequency of transactions’ association’s appearance,using the deviation outlier detection algorithm on the density sequence to mine outliers.It work well in practice. Previous research seldom care the meaning of the outliers.The results remain puzzled to users.Similarity-based outliered pattern detecting model defined knowledge set to characterize the semanteme of outliers and similarity to relate different outliered groups.It used the improved DBOD(Distance Based Outlier Detection) algorithm to mine local outliers,compute knowledge sets and similarity matrix to get the outliered patterns.The algorithm has linear time complexity.The result is understoodable. The three solutions were used as kernel to develop the outlier mining system for anti-money laundering.Tested on practical data,it showed good performance.