保持隐私的数据挖掘-智能与分布计算实验室

保持隐私的数据挖掘

姓名	黄毅群
论文答辩日期	2006.04.23
论文提交日期	2006.04.23
论文级别	博士
中文题名	保持隐私的数据挖掘
英文题名	Privacy Preserving Data Mining
导师1	卢正鼎
导师2	胡和平
中文关键词	数据挖掘;隐私;多方安全计算;数据干扰;点积;关联规则;分类树;异常检测
英文关键词	Data Mining;Privacy;Secure Multi-party Computation;Data Perturbation;Scalar Product;Association Rule;Decision Tree;Outlier Detection
中文文摘	具等多个方面，对现有的保持隐私的数据挖掘做了较全面的综述。在简要探讨保持隐私的数据挖掘的一般原理和典型技术之后，依据多方安全计算理论实现了保持隐私的数据挖掘。点积作为一种计算工具，被许多数据挖掘任务所采用。在分析现有的安全点积协议的基础上，依据完全的两方安全计算模型，提出了一种新颖的PPSP安全点积协议，并进行了安全性和计算/通讯开销的分析，实验验证了协议的有效性。由于采用了数据置换和分段技术，各方的安全性都得到了提高。另外，借助半诚实的第三方的参与，提出了ESSP协议和EPPSP协议，分析表明，ESSP协议可以使各方获得一致的安全性，而EPPSP协议则可以提高协议的性能。关联规则和分类挖掘的关键步骤分别是寻找全局频繁项集和为每个节点找到最佳属性，在应用安全的点积协议于关联规则和分类挖掘时，给出了能有效地计算项集的支持度、属性的熵和信息增益的方法，它们都采用了安全的点积计算，不仅可以保持单个数据记录不为它方所知，还可以保护所发现的模式中的敏感信息，在不泄漏各自的事务项的情况下，获得正确的关联规则和分类树。为了协调安全性和性能之间的矛盾，提出了一种现实的多方安全计算模式。该模式在保障挖掘结果正确性的前提下，既能提供“可以接受”的数据隐私，又有“可以接受”的计算/通讯开销。异常检测被用来发现数据集中显著不同于其它数据的对象，为了保持隐私，基于现实的多方安全计算模式，提出了一种结合多方安全计算和数据干扰技术的方案。其中，将超出阈值距离的成对的点的序号进行通讯，同时，随机选取一定数量的正常范围的成对的点分散在上述集合中，并运用安全和方法来挖掘全局异常点；既隐藏了真实的信息，又提高了算法的效率。
英文文摘	One of the most important innovations in the twentieth century is the technology which can handle great deal of digital information. On the one hand, computer and network impact the society and human’s normal life greatly. On the other hand, their development has taken people’s right of privacy into an awkward situation. The traditional right of privacy is an abstract concept. However, with the ubiquity of computers and network, privacy concerns have posed technical challenges fundamentally different from those that occurred before the information era. It refers to the right of users to conceal their personal information and have some degree of control over the use of any personal information disclosed to others. Same as other information, privacy information presents “0” and/or “1” as single data and aggregate data in the computers and network. Data mining technology is to extract interesting and comprehensive knowledge for users from large quantities of data. However, data is often separated in several different sites. Traditionally，it is necessary to integrate the distributed data into a data warehouse or data market to finish the mining process. Privacy, legal and commercial concerns restrict centralizing these data. Thus, it has inaugurated a new area which combines privacy preserving and data mining. It concentrates on developing accurate models without sharing precise individual data records. The rights of privacy in the traditional and informational era are introduced respectively. Also the concept of privacy within data mining is given. Besides, an overview of privacy preserving data mining is addressed in the following aspect: data distribution, the state of the art in privacy preserving, tools used in privacy preserving data mining, etc. Secure Multiparty Computation is then used as the theoretical framework for our research. Scalar product is widely used in data mining. After existing private scalar product protocols are analyzed, several private scalar product protocols based on two-party and three-party computation are proposed. Both the security and communication/computation analysis are given. The above secure scalar product protocols are used in association rule mining and decision tree building. The basic step in mining the association rule is finding the frequent itemsets for the whole database. The key step in building the decision tree is finding the best property column for each node. Scalar product is applied in both kind of data mining for private concerns. Accordingly, not only single data items but also the sensitive information in the data model discovered can be protected. A Secure Multi-party Computation model which can provide correct result and acceptable privacy is proposed. The computation/communication cost of the model is also acceptable. Privacy preserving outlier detection based on secure multi-party computation and data perturbation is addressed. The idea of distance-based outliers for vertically partitioned data is introduced. Based on a practical secure multi-party computation model, an efficient algorithm for privacy preserving outlier detection with data perturbation techniques is proposed. It is not necessary to communicate all the distances of pairwise points, but the ID numbers of pairwise points whose distance exceed the threshold. Some pairwise points whose distance is within the threshold are chosen into the above points to hide private information. The algorithm maintains integrity and good privacy of the data sets of each party while keeping communication and computation cost low.