搜索引擎中的相似网页探测算法研究-智能与分布计算实验室

搜索引擎中的相似网页探测算法研究

姓名	郑鹏
论文答辩日期	2008.06.05
论文提交日期	2008.06.10
论文级别	硕士
中文题名	搜索引擎中的相似网页探测算法研究
英文题名	Research on Near-Duplicates Detection Algorithm of Search Engine
导师1	卢正鼎
导师2
中文关键词	搜索引擎;相似网页;相似度;指纹
英文关键词	Search Engine;Near-Duplicates;Similarity;Fingerprint
中文文摘	相似网页（Near-Duplicate Web Pages）在互联网中的大量存在，给搜索引擎带来了多方面的问题，如爬行程序反复的搜录同样内容的网页给搜索引擎的爬行程序自身及互联网都带来了沉重的负担，由此导致索引的重复与额外存储空间的消耗，并降低了搜索引擎的性能和用户体验。因此，若有效的相似网页探测算法能够去除大量的重复网页，则可以大大减轻搜索引擎的负担和提升搜索引擎的性能与用户体验。回顾了搜索引擎的发展历程、分析了搜索引擎的工作原理，并着重研究了搜索引擎的相似网页探测算法的现状，在分析现有算法的优势与不足后，概括了准确有效的相似网页探测算法所应该具备的两个基本条件。在经典的Simhash指纹算法和Shingle算法的的基础上，针对这两个算法的不足，以上述两个基本条件为基准，给出了多种改进方案。针对Simhash算法所缺乏的单词位置信息等，将单词位置信息融入单词权重；针对Shingle算法缺乏词频特征，于是将全部Shingle的指纹进行叠加。为了进一步改善算法性能，考虑两个算法的特点，特将两个算法进行集成，同时进一步挖掘网页内容特征，将词性、词频、单词位置等融入Shingle权重。另外由于Shingle数量级太大，而给出根据单词指纹叠加生成Shingle指纹的方法。改进主要集中于提取更加完善的网页内容特征，从而提高网页相似度计算的准确度。根据这些算法的改进，以Manku等人的指纹探测算法为基础，构建了一个原型系统，以实验分析验证了算法改进的有效性，并将该原型系统进行重构加入到一个搜索引擎爬行系统中，实现了有效的相似网页在线探测。
英文文摘	There are a great number of duplicate or near-duplicate web pages in the Internet, and a lot of problems arise from the duplicates, such as it is a great burden for both crawlers and the Internet, and it leads to duplicate indecies and extra space to store files and a great effect on the performance of search engines. Some effective duplicate page detection algorithms can discover and remove most of the duplicates, reduce the burden of the crawlers of search engines and lead to better performance. This paper has reviewed the development of search engines, analised the principles of search engines, and then research the situation of different duplicate detection algorithms. On the basis of analysis of the advantages and disadvantages of current duplicate detection algorithms, it came to a conclusion that an excellent duplicate detection technique has to satisfy two necessary conditions. Simhash fingerprinting and Shingling are two classic algorithms of duplicate detection. On the basis of complete analysis of these two algorithms, several improved solutions were proposed, according to the two necessary conditions. In order to get more web page features, the word weight has taken the word position into account, had all the shingle to generate the final fingerprint of a web page, and even integrated the two algorithms to improve the performance further, on the basis of their characteristics. Since the huge magnitude of shingles, a new method, which based on all the words included in the shingle, was proposed to generate the fingerprint of a shingle. The proposed improved solutions is based on the web page feature selection, in order to improve the accuracy of the computation of the similarity between two duplicates. Then a prototype was developed on the basis of Manku’s algorithm, which is aimed to detect duplicate web pages efficiently, to confirm the effectiveness and efficiency of the improved solutions. Finally, a crawler system having the prototype as a subsystem was developed to detect duplicate web pages online.