金融数据挖掘中的增量聚类算法及应用研究-智能与分布计算实验室

金融数据挖掘中的增量聚类算法及应用研究

姓名	孙小林
论文答辩日期	2004.05.10
论文提交日期	2004.05.11
论文级别	硕士
中文题名	金融数据挖掘中的增量聚类算法及应用研究
英文题名	The Increment Clustering Algorithm in Financial Data Mining and Its Application and Research
导师1	卢正鼎
导师2
中文关键词	增量聚类;概念聚类;孤立点;金融数据挖掘;洗钱
英文关键词	increment clustering;conceptual clustering,;outlier;financial data mining;money laundering
中文文摘	传统的聚类分析方法一般都没有考虑大容量数据集合的问题，而数据挖掘技术在金融领域的研究重点之一就是如何从海量数据中高效率地获取知识；另外，传统聚类方法的研究多集中于数字属性的数据，而电汇数据中存在大量非数字属性以及具有多种特征的数据集合；聚类输出的结果不容易理解也是传统聚类分析方法的问题之一。因此，反洗钱系统中的聚类算法的研究主要集中在如何提高大型数据集合的聚类效率、如何处理具有各种特征的数据集合，如文档数据、分类数据等以及如何对聚类结果给出概念性解释。国家外汇管理局决策支持系统已经着手研究在非现场监管系统中数据挖掘技术的应用。将大规模数据集合高效地划分为有意义的子集是金融数据挖掘的基本问题之一。由于数据采集时的随意性和不规则性，加上市场发展的渐进过程和管理制度的滞后，使得金融数据挖掘必须在缺少背景知识的情况下，处理属性类型复杂、有噪音及孤立点和不完整的数据。传统的BIRCH算法由于其增量特性适应于大型数据库，但是该算法利用的汇总信息的思想无法处理分类属性的数据；K-means算法虽然可以处理分类属性的数据但是由于其高昂的代价而无法适应于大型的数据库。笔者结合基于分类方法的K-means中心点算法以及基于层次方法的BIRCH增量算法提出核心树（Core-Tree）的思想来弥补两个算法的缺点，即：使用中心点的思想来表示BIRCH算法中汇总信息，利用类核心的思想来提高确定中心点的效率；与此同时，将基于概念模型的方法应用到聚类输出结果中，使输出结果被解释为可以理解的层次关系，从而改善提高该算法的输出质量。最后笔者提出了将核心树算法应用到国家外汇信息管理决策系统中的方案, 并通过实验证明了将该算法应用到金融数据挖掘中能够达到预期效果。
英文文摘	Clustering analysis in data mining deploys many traditional methods. Most of these methods have not been considered large volume data sets. However, to efficiently obtain knowledge from large amount of data sets is the top-leading problem in financial data mining area. In addition, traditional clustering analysis has mainly focused on numeric data rather than other types of data that exists in financial field. The difficulty to understand the output of clustering is a problem of traditional clustering analysis methods. Therefore, clustering analysis in detecting money laundering aims at improving efficiency of algorithm and ability of processing variant types of data such as document, categorical data etc. and giving the conceptual explanation to the result of clustering. The SAFE-MIDSS (State Administration of Foreign Exchange－Management Information & Decision Support System)has set to research the application of data mining technology in detecting money laundering system. The fundamental issue in financial data mining is to divide large volume data sets into meaningful subset effectively. Since the course of data collection is irregular, as well as the gradual development of market and the lagging of manage system, financial data mining must deal with all the incomplete data with complex attribute, noise and isolated points in the circumstances of lacking background knowledge. Traditional BIRCH algorithm suits for large volume data set due to its characteristic of increment. However, the algorithm could not deal with categorical data by its Summary Clustering method. Although K-means algorithm can deal with categorical data, the high price of computing makes it difficult to be applied to large data set. Basing on the K-means center points algorithm and the BIRCH increment algorithm, the author poses the concept of Core-Tree which could make up the weakness of these two algorithms, That is, using center point to indicate the summary information in BIRCH, and using class core to improve the efficiency of center point orientation. Meanwhile, applying the method based on conceptual model to the data output of the clustering could make the result easy to understand, which contributes to improving the quality of output. Eventually, the author brings forward the project of applying the core-tree algorithm to SAFE-MIDSS, as well as proves the algorithm can reach the prospective purpose.