基于粗糙集和聚类的数据挖掘算法及其在反洗钱中的应用研究-智能与分布计算实验室

基于粗糙集和聚类的数据挖掘算法及其在反洗钱中的应用研究

姓名	陈云开
论文答辩日期	2007.06.04
论文提交日期	2007.06.08
论文级别	博士
中文题名	基于粗糙集和聚类的数据挖掘算法及其在反洗钱中的应用研究
英文题名	Study of Data Mining Algorithms Based on Rough set and Clustering and Application in Anti-Money Laundering
导师1	卢正鼎
导师2
中文关键词	数据挖掘;粗糙集;超图模型;高维数据聚类;增量概念聚类;可疑交易判定;反洗钱
英文关键词	Data mining;Rough set;Hypergraph model;High dimensional data clustering;Incremental conceptual clustering;Suspicious trade identifying;Anti-money laundering
中文文摘	数据挖掘的任务是发现大量数据中尚未被发现的知识，特别是一些隐藏在大量数据中的关系和趋势。数据挖掘技术的这一特点和功能正是反洗钱监测分析系统所需求的。数据挖掘中的多种算法，在反洗钱领域都有着良好的应用前景。利用数据挖掘技术进行反洗钱数据监测和分析，是国内外研究的一个热点。因此，研究适合金融交易数据特点的关键的数据挖掘算法，并运用到中国反洗钱系统建设中，具有十分重要的理论研究和实际应用价值。粗糙集理论是处理模糊和不确定知识的一种数学工具, 已在人工智能与知识发现, 模式识别与分类, 故障检测等方面得到了较好应用。给出了基于粗糙集的挖掘算法生成决策模型，用于判定交易是否可疑。该算法首先实现属性约简，形成分辨矩阵，然后从中发现规则，这一算法适用于决策属性和分类属性依赖关系不明确、给定数据不完备的情况。考虑到高维空间中数据分布较稀疏的特点，常用聚类算法多用数据之间的距离尺度来衡量数据间的相似度，然后聚类，不能产生很好的聚类效果。给出基于超图的聚类算法，把一个求解高维空间聚类问题转换为一个超图分隔寻优问题，高维空间的关系转化成超图，用超边的权重来描述空间点间的关系。对超图的分割实际上就是聚类的过程，将权重大的超边中包含的数据点尽量放在一个类中，同时使被切割的超边权重之和最小。不需要对数据集事先进行降维，即可完成聚类过程。能有效祛除噪声点，在高维空间获得较好的聚类结果。针对传统聚类算法难以解释其聚类结果的问题，通过语义中心对聚类结果进行概念描述，使语义中心在最大限度上反映类簇的特征。由于分类型数据在金融数据中占着较大的比重，概念聚类相比传统基于数值型数据的聚类能够更好地适应分类数据。给出的基于解释规则的增量概念聚类算法能够给出聚类结果的大致涵义，并能够通过概念与属性之间的确信因子和包容因子的计算生成规则，在更深层度上挖掘隐含信息。在上述研究基础上，根据我国具体的反洗钱的具体情况，借鉴美国、加拿大、澳大利亚等国反洗钱系统建设的成果和经验，研究适合中国国情的反洗钱信息系统。在分析了反洗钱系统的建设背景、已有的信息化基础的基础上，确定了系统的建设目标，设计了反洗钱系统的总体框架，包括信息辅助核查平台、检测分析平台、反洗钱数据挖掘平台三大部分。基于上述理论和研究成果，结合数据集成与交换、数据仓库和OLAP技术，开发和实现了一个反洗钱信息系统，已成功应用于国家外汇管理局反洗钱实际应用，并在全国推广。该系统是我国研制的第一个专业化、智能型的反洗钱信息管理系统，实现并加强对反洗钱数据的分析和处理，提高反洗钱工作的效率和质量，取得满意的效果。该项目获中国人民银行2006年银行科技发展奖二等奖。
英文文摘	The task of Data Mining is to find great deal of knowledge that have not yet been found ,particularly the relations and treads hidden in the data. The characteristics and functions of Data mining meet the demand of anti-Laundering Monitoring and Analysis system. Many algorithms in Data Mining field have a good prospect in the anti-laundering field. The research of anti-laundering monitoring and analysis of data by data mining technology is a hot spot. Therefore, the studies for the key algorithms that are suitable for data features of financial transactions data and should be used in the anti-laundering system is of great value in the theoretical study and practical application. Rough set theory is a new mathematical tool dealing with vagueness and uncertainty, has found its applications in many areas such as AI, KDD, pattern recognition and classification and fault diagnostication. A rough set based mining algorithm is proposed to generate decision model in the anti-money laundering system. The generated model is employed to find suspicious transactions. The algorithm first reduces the attributes by constructing discernibility matrix. Then rules are found from the training data set. The algorithm can be helpful when the dependency of decision and classification attribute is vague and the data set is not full. In view of the conventional clustering algorithm which scale the similarity between objects through the distance metric and not get a good cluster result for high dimensional data, a new hypergraph-based is proposed, which formulates the data clustering problem in a high dimensional space as a hypergraph partition optimal problem. It is applied to the clustering of high-dimensional data that high-dimensional space is transformed into the hypergraph, the relationship between points is describe by the weight of the super edge. Segmentation on the hypergraph is just a clustering process during which put the points contained in the hypergraph with larger weight into one class, while make the sum of the weight of the segmented super edge smallest.It does not require dimensionality reduction,people can filter out noise data from the clusters very effectively and control the quality of the cluseters. In order to explain the semantics of clusters generated by clustering algorithm, the semantic centers is proposed to describe the outputting result by concepts, which makes algorithm represent the characters of clusters better. Comparing to traditional clustering algorithms, the conceptive clustering can accommodate itself to category data much better. The explanative rule based incremental conceptive clustering algorithm is proposed can get the semantics of the clustering results, and discovery deeper hidden information by generating rules with calculating the assurance factor and subsumption factor. Based on the study aboved, according to the specific anti-money-laundering situations of China and learning from the results and experience of the building of anti-money-laundering system in the United States, Canada, Australia, and other countries, study the anti-money laundering information system which is fit for the situation of China. Based on the analysis of the anti-money-laundering system in the background, and the base of information ,set the goal of building the system ,raise the anti-money-laundering system of the overall framework, which includes information supporting verification platform ,Analysis platform and Anti-Money Laundering Data Mining Platform . Based on the above theory and research results, combined with the exchange of data integration, data warehouse and OLAP technology, develop and imply an anti-money laundering information system that has been successfully applied in the practical application of the Anti-Money Laundering by the State Administration of Foreign Exchange, and across the country. The system is the first professional, intelligent anti-money-laundering information management system developed by China, achieve and strengthen anti-money-laundering data analysis and processing, improve the efficiency and quality of the work of anti-money-laundering, and get satisfactory results . The project won the bank of second prize in People's Bank of China Technology Development in 2006.