智能与分布计算实验室
  基于网页主题相关度的搜索引擎排序算法研究
姓名 梁昆
论文答辩日期 2008.06.05
论文提交日期 2008.06.10
论文级别 硕士
中文题名 基于网页主题相关度的搜索引擎排序算法研究
英文题名 Research on Ranking Algorithm Based on Topic Similarity of Web Pages in Search Engines
导师1 卢正鼎
导师2 李瑞轩
中文关键词 搜索引擎,排序算法,网页等级,锚文本,链接分析,主题相关度
英文关键词 search engine, rank algorithm, PageRank, anchor text, similarity, analysis of links
中文文摘 随着互联网中的信息不断快速增长,在庞大的互联网中搜索自己所需要的信息,已经成为大部分用户经常性的操作。由于网络上的资源缺乏统一的规范,大量网页的结构性、组织性比较差,并且包含的内容涉及到广泛的领域,搜索引擎返回给用户的结果往往不能很好地满足用户的需要。 在研究分析了国内外搜索引擎的发展背景的基础上,对基于内容的排序算法和基于链接的排序算法进行了深入的分析,并探讨了国内外现有的基于链接结构的改进算法,对其进行了归纳和总结。 为了把符合用户检索需求的网页更好的排在搜索结果的前面,得到较高的查准率,使其符合用户的浏览习惯,针对现有基于链接结构的PageRank算法的特点和不足,基于网页主题相关度的改进PageRank算法很好的解决了上述问题。这种改进算法通过引入网页相关度的信息,改变了网页之间PageRank值传递策略,从而提高算法的精确度。通过分析网页内容,提取出网页中所有链接和与其相对应的锚文本,建立网页链接库,利用向量空间模型(VSM)计算链接锚文本和网页内容的相关度,在此基础上实现离线计算改进后的PageRank算法,从而提高用户搜索的满意度。 最后通过实验表明,改进的PageRank算法能够指导用户方便的找到所需要的网页,而且,通过引入网页主题相关度的分析,提高了返回结果的查准率,同时用户的满意度进一步提高,并且给出了继续研究的方向,以及可能存在的问题。
英文文摘 There are many kinds of resource in World Wide Web at present, and the amount of them is increasing rapidly everyday. It becomes a habit for internet users to search the useful information of their own. The appearance of search engine helps us to realize the searching in a convenient way, so it becomes a more and more important tool to surf the web.However, the results returned by search engine often can’t meet our needs well because the resource of internet is lack of a given criterion and many web pages have bad structure and regulation. The broad fields that web page refer to also lead search engine to search out much information that has little relativity with our query. What we hope is that the information we need can place in the front of the search result, so we can find out the right information easily. As it analyzes the background of search engine development in and abroad, also deeply analyzes the sort algorithms based on content and links. And, it compares the rank algorithms for search engine in and abroad and then concludes an improving method for existing technologies. It aims at implementing an efficient ranking algorithm to give users a high precision of getting search results. According to the characteristics and shortages of the existing algorithms, it gives an improvement of PageRank based on anchor text and relevance to the web pages, which improved algorithm form similarity of web pages and analysis of links of the web page two aspects. By analyzing the content of a web page to extract all the links in the page and its corresponding to the anchor text, create pages link library. The VSM model is use to calculate the similarity of the anchor text and the relevant page, on this basis to achieve improved the off-line PageRank algorithm, and then it compares and analyzes the improved algorithm and the original algorithm. At last, the result returned by experiment reflects that, the improved ranking algorithm can guide users to find out useful information easily and efficiently. But also, according to the analysis of the web page topic similarity, the improvement of the algorithm can get better query precision, and then it brings forward the direction of the next step of research and some potential problems.