智能与分布计算实验室
  粗糙集在Web挖掘中的应用研究
姓名 易高翔
论文答辩日期 2006.05.09
论文提交日期 2006.05.15
论文级别 博士
中文题名 粗糙集在Web挖掘中的应用研究
英文题名 Research on Rough Set for Application in Web Mining
导师1 卢正鼎
导师2 胡和平
中文关键词 Web挖掘;信息检索;分类;聚类;粗糙集;扩展粗糙集;容错粗糙集
英文关键词 Web mining;information search;classification;clustering,rough set;extended rough set;tolerance rough set
中文文摘 Web挖掘广义定义为从Internet上发现和分析有用信息。Web挖掘可以协助Web搜索引擎找出高质量的网页和分析Web语义结构、点击信息等,使Web服务更加智能化。目前Web挖掘技术中,特别是Web文本的分类、聚类,采用的核心算法是基于词频统计的矢量空间模型算法。该算法中文档的特征词的选取和相似度测量是关键。对特征词的选取和权重计算有很多研究,取得了积极效果。但是,特征词之间的关系研究很少。如何准确描述和恰当利用特征词之间的联系,是改进目前Web挖掘算法的一个途径。粗糙集理论是一种强有力的处理不确定性关系的数学工具,粗糙集扩展更能满足实际应用的需要。 从知识分类的观点剖析了粗糙集理论的内涵,指出了粗糙集扩展的必要性。以Web信息检索为研究对象,以扩展粗糙集理论为工具,以知识获取为目的,提出了基于模糊粗糙集的网页个人兴趣分级算法,较系统深入地研究了基于容错粗糙集的Web查询词的扩展、分类、聚类理论与应用。 在经典粗糙集合的基础上,针对数据的过拟合而使其对新对象的预测能力降低;对原始数据本身的模糊性缺乏相应的处理方法;针对粗糙集边界区域的刻画比较简单,而没有一定程度的属于或包含等,探讨了几种典型的扩展模型,如变精度粗糙集模型、模糊粗糙集模型和容错粗糙集模型。分析了这几种模型的相关性质,指出了它们实质上可以统一到广义粗糙集的模型上,只是针对的关系基础和定义的隶属函数不同。从而能更加直观地理解粗糙集理论,启发应用粗糙集理论开发更好的数据挖掘算法。 分析了Web检索中查询准确率不高的一个重要原因是用户对查询语句的不能精确表示,提出了基于容错粗糙集的查询词自动扩展方法,用特征词容错类描述查询语句与返回结果之间的不确定关系,用查询语句上近似集合构造新的查询语句,自动增加了带权重的相关查询词,并在标准数据集上进行了实验,结果表明该方法,能有效地进行查询词扩展,提高了检索性能。 为解决网页分级HITS和PageRank算法中共同的缺陷主题“漂移”问题,结合用户的历史查询词,采用模糊粗糙集的理论来描述个人兴趣与文档之间的不确定关系,在比较个人兴趣和网页相似度中,采用了上近似集相似与下近似相似结合的方法,实现了一种基于模糊粗糙集的个人兴趣网页分级算法。实验结果说明基于兴趣的PageRank方法是可行的。 分析总结了粗糙集理论的Web分类一般方法,指出大多数方法都是把预先定义的类别看成是互斥的概念,很少考虑类与类之间有相联系的概念。利用Web文档特征词同时出现的价值,用容错粗糙集描述这种联系,给出了基于容错粗糙集的Web文档分类方法,该方法抓住了类与类之间有一定交叉概念这个关键,用特征词近似相似来精确判断文档类别,提高了Web分类效果。 探讨了几种聚类策略,阐述了聚类的本质就是类内样本点“抱团”,给出了基于容错粗糙集的Web搜索结果的聚类方法,实现了聚类标记算法,对比实验表明,该方法优越于普通K均值算法。
英文文摘 The Web mining is generally defined as discovering potential and available patterns or knowledge on Internet by means of data mining technique. With the help of Web mining, the engine may find high quality page and make Web server intelligent by analyzing semantic structure and click information. The present Web mining technology, especially the core algorithmic of Web document classification and clustering are based on statistical word frequency Vector Space Model (VSM). The key of the algorithms is the strategy of terms selection and the measurement of similarity. In order to improve the results, many researchers have pursued for these two techniques, such as adopting different terms weight and similarity formulas. But the relationship between the terms is rarely studied. How to describe the relationship exactly and what to make use of the association is a new way to improve the conventional Web mining algorithm. The rough set is a powerful mathematics tool for dealing with uncertain relationship and the extended rough set is more satisfied for practical application. The center viewpoint of rough set was dissected on the base of knowledge category. It is necessary to extend the rough theory for more application. the research the Web application based on extended rough set theory was carried out. With target of Web information search and by the tool of extended rough set theory, aiming at knowledge discovery, the individual interest Web Rank algorithm on fuzzy rough set was carried through. The method of query term expansion and Web pages classification and clustering by means of tolerance rough set were also systematically developed. Upon classical rough set, the ability of new object forecast is low because of data simulation being over normal degree. It is helpless for rough set to dispose fuzzy data directly. And the description of boundary is simple. For example, there is no relation of part containing or belonging to in rough set. Hereby, several extended rough modes upon classical rough conception were discussed, such as variable precision rough set and fuzzy rough set and tolerance rough set. Some relational properties between these modes were analyzed. It is necessary to point out that these modes could unify to the generalized rough set in nature just only difference of relation or membership function. This is useful of intuitively explaining rough theory and enlightening somebody to implement better data mining algorithm. One of important reasons caused low precision was presented, which was inaccurate express of the query. So a new method of automatic query expansion based on tolerance rough was put forward. In the algorithm, the uncertain connection between query terms and retrial documents was describe as term tolerance class. The upper approximation set of query sentence was looked as query expansion. The new additional terms were also given weight numbers. The results of experiment on standard data collection showed that the approach was effective on query expansion and high search precision was gained. In order to overcome the “topic draft” in HITS and PageRank, the new personal interest page rank algorithm based on fuzzy rough set was discussed. In this algorithm, history query terms were denoted user’s interests and the connection between user interests and documents was described by means of fuzzy rough set. The combination of upper approximation set and lower approximation set were made use of measuring the similarity between interest and document. The experiments results showed that the method was feasible. Some Web document taxonomies about rough set were summarized. In most methods, the class was looked as exclusive object and the connection between classes was little in use. A Web document classification based on tolerance rough set was presented. In the algorithm, the key of cross conception among categories was described using terms tolerance rough set and the capability of Web classification was improved highly. Several clustering strategies were discussed and the essence of clustering was expounded as “to hold together”. A new algorithm of search results clustering based on tolerance rough set was given. The category label algorithm was implemented. The contrast experiment demonstrated that the proposal method exceed general K-mean clustering algorithm.