智能与分布计算实验室
  搜索引擎的实现研究及相关优化
姓名 文坤梅
论文答辩日期 2003.05.09
论文提交日期 2005.08.06
论文级别 硕士
中文题名 搜索引擎的实现研究及相关优化
英文题名 Research on the Implementation and Optimization of Search Engine
导师1 卢正鼎
导师2
中文关键词 搜索引擎;更新度;元搜索引擎;结果优化排序;信息过滤;相关性
英文关键词 search engine;freshness;meta-search engine;optimal rank of the results;information filtering;relevancy
中文文摘 搜索引擎的实现有赖于几个关键模块的协同工作,包括爬行、本地网页存储、索引、排序搜索结果以及加速搜索性能的链接分析应用等。对搜索引擎的体系结构及实现原理进行了相关研究,介绍了每个组件的设计与实现技术。 网页更新是影响搜索引擎效果的关键技术,其算法的设计很大程度上影响了网页更新度。为了提高网页更新度,提出了一种优化算法即分类更新的网页爬行策略,此方法以网页的改变历史为基础来评估其改变频率,并以此作为分类网页的标准,然后基于平均值算法得出不同网页集合的更新速度,从而实现网页更新,达到均衡分配系统资源的目的。经分析表明,在现实网络中其执行效果优于目前存在的统一更新策略和个体更新策略。 元搜索引擎提供多个搜索引擎的集成环境,具有比传统引擎覆盖面大、可扩展性好以及结果相关性高等优点,其中排序各组成系统的返回结果是提高其效率的核心技术。在充分理解相关度概念的基础上,提出了一种基于权值的结果优化排序方法,综合考虑用户需求,包括兴趣权值、人数权值和位置权值,并采用固定容量的网页索取模式,实现了一个小型元搜索引擎的原型系统,经实验验证,其执行性能效果较好,并对结果进行了优化排序。 搜索引擎的实现有赖于几个关键模块的协同工作,包括爬行、本地网页存储、索引、排序搜索结果以及加速搜索性能的链接分析应用等。对搜索引擎的体系结构及实现原理进行了相关研究,介绍了每个组件的设计与实现技术。 网页更新是影响搜索引擎效果的关键技术,其算法的设计很大程度上影响了网页更新度。为了提高网页更新度,提出了一种优化算法即分类更新的网页爬行策略,此方法以网页的改变历史为基础来评估其改变频率,并以此作为分类网页的标准,然后基于平均值算法得出不同网页集合的更新速度,从而实现网页更新,达到均衡分配系统资源的目的。经分析表明,在现实网络中其执行效果优于目前存在的统一更新策略和个体更新策略。 元搜索引擎提供多个搜索引擎的集成环境,具有比传统引擎覆盖面大、可扩展性好以及结果相关性高等优点,其中排序各组成系统的返回结果是提高其效率的核心技术。在充分理解相关度概念的基础上,提出了一种基于权值的结果优化排序方法,综合考虑用户需求,包括兴趣权值、人数权值和位置权值,并采用固定容量的网页索取模式,实现了一个小型元搜索引擎的原型系统,经实验验证,其执行性能效果较好,并对结果进行了优化排序。
英文文摘 The realization of search engine relies on the cooperative work of several modules, including cover crawling, local web page storage, indexing, ranking the searching results and the use of link analysis for boosting search performance. The architecture structure and realization principle of search engine are researched. The common design and implementation techniques for each of these components are introduced. Page freshness is the key to impact the effect of search engine, the design of the arithmetic would highly influence the page freshness. To reform the limitation existed in page freshness, a reformative arithmetic for the page crawling strategy of search engine is provided, called classified freshness strategy. This method evaluates the change frequency of the web pages through their change history and assorts these pages based on the evaluation. Then the freshness rapidity of the pages which belong to the deferent gathers are calculated based on the average value arithmetic. This method could assure that the system resource is distributed equally. The experience shows that the performance of this method is also better than both the uniform freshness strategy and the individual freshness strategy. Meta-search engine can provide multi-search engine environment. So it has many advantages than the traditional search engine, such as the wider coverage, the better performance to expand and the more relevant search results. Ranking the results from multi search engines is the most important technique to improve the efficiency of the meta-search engine. After sufficiently comprehending the meaning of the relevancy, a new improved ranking method is offered based on the value. It roundly considers the need of the user, including interest value, people value and location value, and adopts the pattern of the fixed page content. A small meta-search engine prototype is realized. The experience shows that the actual performance of this method is good and it also optimizes the ranking result.