智能与分布计算实验室
  基于模糊论的Web信息提取方法研究
姓名 张茂元
论文答辩日期 2005.05.08
论文提交日期 2005.05.13
论文级别 博士
中文题名 基于模糊论的Web信息提取方法研究
英文题名 The Method research of Web Information Extraction Based on Fuzzy Theory
导师1 卢正鼎
导师2
中文关键词 模糊;信息提取;汉语分词;主动数据库;分类;匹配;学习
英文关键词 fuzzy;information extraction;Chinese word segmentation;active database;classification;match;learning
中文文摘 Internet的迅猛发展使得网络上的各种资源信息异常丰富,在这些海量的Web信息资源中,蕴含着具有巨大潜在价值的知识,也存在信息过量难以消化、中文词的歧义划分、Web信息形式不一致、并且难以辨识等问题。如何快速、准确地获得有价值的Web信息,如何从这些海量数据中发现知识,这就要求有一个高效、高准确率的Web信息提取工具。 目前已有的Web信息提取方法主要致力解决Web信息形式不一致的问题,并提出了许多较好的方法。但这些方法都是建立在实验基础上的,还缺少进一步的理论分析。为了较好地解决上面问题,作者结合网页分类、汉语分词、模糊匹配的方法,来研究Web信息提取的方法,这对Web信息提取方法的研究是有意义的。 Web信息提取是一个复杂的系统,因此作者依据模糊论的基本思想,给出一种基于模糊论的Web信息提取方法。这个方法包括模糊网页分类、基于特征学习的网页信息提取、模糊匹配、语境汉语分词和分布式主动数据库五个部分。 为解决Web信息过量难以消化的问题,提出了一种基于特征选择和模糊学习的网页分类方法。其中基于加权相似度的特征选择方法依据模糊论思想,来解决巨大维度问题,提高分类速度。它给出了权值wf的计算方法,并证明这个计算方法可以使加权相似度和基于文档关系的相似度相一致,还给出了特征选择算法的加速分析。用模糊学习方法,给出了一种用成员函数作为融入人类知识的机制、以及用来学习成员函数参数的参数学习规则。通过理论推导,用李雅普诺夫函数分析参数学习规则的学习收敛性,揭示参数学习算法朝最小误差方向调整参数的内在因素,并在单参数学习算法收敛性的分析基础上,给出变调整法则的单参数学习算法,来加快参数学习速度。 为解决Web信息形式不一致的问题,提出了基于相关过滤的网页特征信息的提取理论。此理论包含网页特征信息的数学形式表示、一维空间域的网页信息过滤定理及其证明、网页信息的相似性分析等,并指出特征信息提取系统可以把同一网站中具有共同信息模板的网页,用相关接收的方法实现匹配滤波器,提取出信息模板的特征信息。在此理论基础上,融合基于标识规则和基于内容的两种方法,给出了一种基于特征学习的网页信息提取方法。这种方法是依据模糊论思想,研究信息特征的学习方法,来提高信息提取的适应能力。 为解决Web信息难以辨识的问题,提出了一种基于义素的网页信息项语义匹配方法。此方法给出了一种改进的义素相似度,并给出了相似度函数的相关定理及其证明,还分析了改进义素相似度中参数β值的影响效果。在改进义素相似度的基础上,此方法给出了一种基于义素的词相似度,来从语义上去匹配新名词和旧名词。 为了解决中文词的歧义划分问题,提出了一种基于语境的中文切分理论。此理论包含词切分过程的马尔可夫链表示、词切分过程的收敛性分析等,并指出词法是处理无歧义划分的基础,词的语境信息是处理歧义划分的依据。在这个理论基础上,给出了一种基于语境的中文分词方法。 针对信息预测、预警等后续信息处理的及时性要求,在Web信息提取方法中提出了一种面向Agent的分布式主动数据库框架,使数据库能够主动及时地处理信息。此框架分析面向对象方法的局限性,将Agent技术、分布式数据库、主动数据库相结合,并给出扩展事件规则图方法和改进的Coffman-Graham规则并行算法。扩展事件规则图方法给出了它的终止性分析,来解决分布式主动数据库的终止性问题。改进的Coffman-Graham规则并行算法给出了一些相关定理及其证明,并在这些定理基础上,分析了它的并行效果。 基于上述的理论研究成果,研制和开发了国药局网上药品信息监管系统Web-MIND的原型系统,它能够搜索和提取网上医药广告信息、审核Internet上有药品信息和广告的站点等功能。
英文文摘 The rapid development of Internet makes the Web information abundant. Large amount of valuable knowledge, as well as some problems, exists in all these Web information, for example: difficulty in processing excessive information, Chinese word segmentation for the ambiguous words, the information of variable formats, and the recognition of information. So it’s essential to find a high efficient and accurate extraction method of Web information so as to search the valuable Web information and discover knowledge from the information. At the present, the methods of information extraction are focused on solving the problem of variable formats in processing information. But all these methods, which are based on the experiments, are lack of further theoretic analysis. In order to solve the above problems, researches are done on setting up a model of Web information extraction that combines the methods of Web pages classification, the methods of Chinese word segmentation, and the methods of fuzzy match. This is of great importance to the research of information extraction. The Web information extraction is a complex system, so a Web information extraction method based on fuzzy theory is proposed. The method includes a fuzzy classification of Web pages, an adaptive information extraction based on feature learning for Web pages, a fuzzy match method, a Chinese word segmentation based on language situation, and a distributed and active database. In order to solve the problem of difficulty in processing excessive information, a method of Web pages classification, which is based on feature selection and fuzzy learning, is proposed. In the method, the feature selection method based on the fuzzy theory is used to solve the problem of high dimensionality so as to increase the classification speed. It presents a computing method of weight wf, proves that the weight similarity is consistent with the context-based similarity, and analyses its speed. The fuzzy learning is used to propose a mechanism, which uses the member function to combine the human knowledge, and a learning rule of parameters in the member function. Through the theoretic deduction, the Lyapunov function is used to analyze the convergence of the parameter learning, and then the factor of convergence can be proposed to minima of the error function. Based on the convergence analyses of single parameter learning, a single parameter learning with multi-regulations is proposed to improve the learning speed. A feature extraction theory of Web pages is proposed in order to solve the problem of variable formats in processing information. The theory includes the denoting of Web page information, the theorem of Web page information filtration, the similar analyse of Web pages, and so on. The theory indicates that the feature can be extracted from the Web pages of the same Web site by using the similar receiving method. On the basis of the feature extraction theory, this paper combines the label-based extraction method and the context-based extraction method, and takes the advantages of the both, thus leads to the proposition of an adaptive information extraction method based on feature learning for Web pages. According to the fuzzy theory, the method uses the feature learning to improve the adaptive capacity of information extraction. A novel semantic matching method is submited in order to solve the problem of information recognition. It presents an improved sememe similarity, some theorem and their provings, and the effect analyse of the coefficient β. Based on the sememe similarity, a word similarity is put forward to match new words with old words from the view of the semantic features. In order to solve the problem of Chinese word segmentation for the ambiguous words, a theory of Chinese word segmentation based on language situation is proposed. The theory includes the description based on Markov chain for the process of Chinese word segmentation, and the convergence analyse of the segmentation process. The theory indicates that the information of language situation is used to process the ambiguous words, while the lexics is to the unambiguous words. Based on the segmentation theory, an algorithm of Chinese word segmentation is presented. As to timely requirement of some succedent processing, such as information forecast and information alarm, a distributed and active database system framework based on agent-oriented method is proposed in the method of Web information extraction. The framwork can make the database process the information actively and duly. The framwork combine the agent method, distributed database and active database, after the constraints of the object-oriented method are analyzed. And also, the framwork presents a method of expanding event-rule graph and an improved Coffman-Graham parallel algorithm. The former presents its termination analyses to solve the termination problem of the distributed and active database. The latter presents some theorem and their provings, and then analyzes its performance. On the basis of the above theoretic researches, a preliminary prototype named Web-MIND (Web-Monitoring Information of Drug) is introduced, with the functions of searching medical information on the Web effectively, filtering non-relevant contents, and monitoring the Web sites that have medicine information and medical advertisements.