基于隐马尔可夫模型的Web信息提取的研究-智能与分布计算实验室

基于隐马尔可夫模型的Web信息提取的研究

姓名	董泽锋
论文答辩日期	2005.05.11
论文提交日期	2005.05.20
论文级别	硕士
中文题名	基于隐马尔可夫模型的Web信息提取的研究
英文题名	Web Information Extraction Based on Hidden Markov Models
导师1	卢正鼎
导师2
中文关键词	Web信息提取;隐马尔可夫模型;文法推断
英文关键词	Web Information Extraction;Hidden Markov Models;Grammatical Inference
中文文摘	World Wide Web（简称WWW）作为一个全球化的信息空间，蕴含着具有极大潜在价值的信息和知识，然而对于用户来说有用的内容可能是其中极小的一部分，而且难于获取。因此如何能够快速准确地在这些海量信息中提取出用户所需的信息已成为重要的课题，而信息提取技术正是解决此类问题的有效途径。首先对Web信息提取的产生的背景作了一定的介绍，同时也介绍了Web信息提取所使用的相关技术和研究现状。针对实际应用中稀疏型和密集型两种不同类型的数据信息，使用相应的两种提取算法来加以解决，即使用基于隐马尔可夫模型的信息提取算法和融合文法推断的隐马尔可夫模型的信息提取算法。对于稀疏型信息提取任务，基于隐马尔可夫的信息提取算法可以很好地解决。隐马尔可夫模型中的不同状态对应待提取数据中的相应的数据信息，通过使用Baum-Welch算法来学习模型的最优概率分布和Viterbi算法来提取所需的信息。其中使用了混合的隐马尔可夫模型，即在模型的训练阶段通过手工构建三种不同的模型结构，然后通过训练数据集来优化模型结构。实际应用中用户可以根据自己的要求来赋予各个模型相应的权值，这样使得整体的模型的适应性更为广泛。另外，使用了融合文法推断的隐马尔可夫模型的信息提取算法用来对密集型数据信息进行信息提取。利用文法推断来学习文档结构，通过状态合并来得到一个概率有限状态自动机，使得该拓扑结构具有一定的文法，从而有利于进一步的提取。然后使用Viterbi算法来提取所需的信息。
英文文摘	As a global information workroom, World Wide Web (WWW) contains plentiful valuable information and knowledge. But the information which can be used by user is a little proportion and is hard to get for user. So how to extract the usefully required information from the uncountable information becomes an important research topic. Firstly we review the background of Web information extraction, and introduce several relative technology being used and recent research status. Corresponding two algorithms for sparse data and dense data information extraction are used in here, which are the methods of information extraction based on HMM and based on combining grammatical inference with HMM. For the sparse extraction task, using the information extraction based on Hidden Markov Models can resolve it well. The states in the HMM correspond to the required information of the extracted data, and learning the optimal probabilistic distribution by using Baum-Welch algorithm and extracting the required information by using classic Viterbi algorithm. We use a mixture of HMMs, that is, three different HMMs designed by hand. In practical application, the model can be used to more extensive extraction task by giving the individual model a different weight. For dense extraction task, the method of information extraction which combines grammatical inference with HMM is used to solve it. It uses the grammatical inference to learn the document structure, and then by using an approach of state-merging, a probabilistic deterministic finites automaton is obtained. The topology structure covers some grammar which is good for the next extraction step. The Viterbi algorithm is used in this method to extract the required information.