Web信息检索及网页分类方法的研究-智能与分布计算实验室

Web信息检索及网页分类方法的研究

姓名	叶卫国
论文答辩日期	2003.10.16
论文提交日期	2003.10.16
论文级别	博士
中文题名	Web信息检索及网页分类方法的研究
英文题名	Research on Methodologies of Web Information Retrieval and Hypertext Classification
导师1	卢正鼎
导师2
中文关键词	网上知识发现;网页搜索;相似度计算;信息检索;向量空间模型;数据融合;分类
英文关键词	Web Mining;Web search;similarity scoring;information retrieval;Vector Space Model;data fusion;classification
中文文摘	随着Internet的爆炸性增长，WWW已经发展成为包含多种信息资源、站点遍布全球的巨大动态信息服务网络，为用户提供了一个极具价值的信息源。Web是海量的、异构的、动态的，易造成“信息过载”；在这些海量的Web信息资源中，蕴含着具有巨大潜在价值的知识。人们迫切需要能够从Web上快速、有效地发现资源和知识的工具，提高在Web上检索信息、利用信息的效率。实际中最终用户能消化吸收的信息量与时间之比呈常量，因此如何自动、有效地从互联网的巨量信息中获取知识是亟待解决的问题。
英文文摘	“Turning the Web into the world's largest knowledge-base” has been proposed as one of the most important challenges in artificial intelligence nowadays. Those abundant information resources mostly can be considered as semi-structured data sources: those are sources containing data that is fielded but not has a global schema. Documents such as mechanical manufacture, medical information, etc. are falling into this category. However, these resources are usually formatted for use by people (e.g., the relevant content is embedded in HTML pages), and despite the markup formatting of data sources provides some hints about the record and field structure, this is also be obscured by the presentation aspects of formatting intended for human viewing and the wide variation in formats from site to site. In fact the information that the user can grasp is proportional to the time, so it is essential to study effective Web Knowledge Discovery (Web Mining) methods. The Web is huge, diverse and dynamic, thus raises the “information overload”, and scalability, multimedia data issues. Information Retrieval (IR) on the web is the automatic retrieval of all relevant documents, the same as resource finding of intended web documents, while the same time retrieves as few of the non-relevant as possible. Web IR has become very popular and favorite at present. It concentrates on the using traditional text IR methods in the Internet, as well as the properties of Web graph and social networks, for example Google. This research focuses on how to effectively and broadly get relevant Web pages and contents, filter Web pages and assign proper labels for them. Accurate finding user-specific information in the Web is very difficult. And traditional Web search engines take a query as input and produce a set of (hopefully) relevant pages that match the query terms. While useful in many circumstances, search engines have the disadvantage that users have to formulate queries that specify their information need, which is prone to errors. The search engines have another two significant shortcomings: low precision and low recall. Based on the discussion of PageRank, HITS and similarity between Web texts, some new algorithms called SG-HITS (Similarity Graph-HITS) for finding relevant documents on the Web are introduced. These methods use not only the hyperlinks of web graph, but also the similarity scoring of term weights in document representations. When using the algorithms to find Chinese medical information from the Internet, the experiments showed that it has better precision than traditional IR methods and basic HITS alogorithms. As no one searh engine surpass any other search engines under all circumstances, and the “best” system for a particular task may not be known a priori. The Meta-search is an effective way to find relevant documents from the vast source of information in WWW. Three data fusion methods for the Meta-search have been presented: Similarity Linear Combination, Unbiased and Biased-Bayes. The Biased-Bayes use the ODP directory for priority calculation, and needs few training process. Comparing with other fusion methods, these methods promote the average precision evidently and steadily. They yield improvements in the effectiveness and the effectiveness is comparable to that of approach that analyzing the web documents. Then issues a new data fusion based on semi-complete graph, and it can be effectively implemented with HeapSort. The impact of Dempster-Schafer theory in artificial intelligence has become more and more important. It has clarify the boundary between imprecision and ignorance. The bpa (basic probability assignment) and combined belief function is equivalent to data fusion. Based on DS theory, two fusion models have been proposed: fusion on rank position and fusion on Web page title. In addition, it put forwards a formal model for data fusion, which represents documents, ranking in the proposition and sentence space, and rerank the merged list using knowledge about the document. Existing mature text IR techniques are inadequate for navigating this unprecedented abundance of semi-structured information of Internet. Traditional text categorization algorithm such as NB, Na?ve Bayes, SVM, are not sufficient for web classification. After having discussed the process in the web categorization, a new GR measure for feature selection and term weighting is proposed. Then a novel hyperlink based classifier is presented. It uses the characteristics of the web graph; calculate the labels by matrix iteration. Experimental comparisons of these algorithms showed that they are more appropriate than traditional IR methods in web categorization. And also proposed a new Web classification method based on Hyperlink clustering. The Hyperlink clustering is more appropriate in huge Web page categorization than traditional text clustering. Finally, according to the methods discussed above, a preliminary prototype that designed and implemented on above methods and technologies, named Web-MIND (Web-Monitoring INformation of Drug), is introduced. Web-MIND can effectively search medical information in the Web, filter non-relevant contents; and surveil the Web site that have medicine information and medical advertisement, audit the behavior of Web medical advertisement.