基于机器学习的Web信息提取技术的研究-智能与分布计算实验室

基于机器学习的Web信息提取技术的研究

姓名	金莉
论文答辩日期	2003.05.09
论文提交日期	2005.08.06
论文级别	硕士
中文题名	基于机器学习的Web信息提取技术的研究
英文题名	Web Information Extraction Technology Based on Machine Learning
导师1	卢正鼎
导师2
中文关键词	机器学习;Web信息提取;FOIL算法;多策略学习;填充标记
英文关键词	machine learning;Web information extraction;FOIL algorithm;multi-strategy learning;filling-tag
中文文摘	随着World Wide Web逐步成为全球最大的信息知识库，如何高效迅速地从这个知识库中提取有用信息已经成为信息处理领域的研究重点。传统信息提取技术的研究侧重于通过一定的语义分析，对规范、结构化的文本进行信息分类和处理。但Web信息不属于规范的结构化文本范畴，它是介于结构化文本和非结构化文本之间的半结构文本，其文本结构无法确定，传统的语义分析也不再适用，于是设计能够适用于Web信息的提取方法势在必行。机器学习的介入为Web信息的提取开辟了新的研究方向，其自适应机制能够较好的适应Web信息的动态性和松散性，使系统在提取Web信息时可根据反馈信息自动完成旧规则的修改和新规则的推导。目前国内外对基于机器学习的Web信息提取有一些研究，但这些算法在实际应用中表现出种种缺陷，因此改进已有算法和提出新的算法显得尤为重要。通过一定的分析和比较，给出了两种新的基于机器学习的Web信息提取算法并且对原有FOIL算法进行了有效改进，并在实验基础上对每个算法的性能进行了全面的分析和评估。针对FOIL算法在学习不相邻网页间复杂联系时表现出来的不确定性，提出了一种基于网页间联系的新的路径学习算法；多策略学习算法将多个学习算法相结合，解决了单一机器学习算法推导提取规则时的片面性问题，所得规则能更全面地反映Web信息的分布规律；基于模板填充标记的学习算法采用自底向上推导规则的模块层叠方法，通过在提取模板中填充一定数量的有助于识别信息类别的SGML标记，使算法能覆盖Web页中的不可见信息，可有效控制学习过程中信息的遗漏和溢出，实现智能化Web信息提取。此外，将研究的算法应用于国家药品监督管理总局 “Internet上药品信息及电子商务监管系统”的开发中，实验结果表明上述三种算法在信息查全率和提取精确度上较现有算法有较大的提高。
英文文摘	World Wide Web is becoming the largest information base in the world. How to effectively and rapidly extract useful information from this information base has become an emphasis in domain of information-transaction research. Traditional information extraction technology, which is based on some of semantic analysis, only classifies and deals with normative document collections. Web information is not belong to formal structured text, but is the semi-structured text that falls between structured text and free text. Its structure is uncertain and traditional semantic analysis is unsuitable. So designing a new information extraction method on the Web is imperative. The introduction of machine learning exploits a new research domain for Web information extraction. The self-learning ability of machine learning is suitable to dynamic and loose Web information. It can automatically amend old rules and induce new ones by feedback information when extracting Web information. Although there has been some research on Web information extraction based on machine learning at present, those methods always have many limitations. So improving existing algorithms and bringing forward new ones are seemed more important. This thesis presents two new machine learning algorithms for Web information extraction, improves FOIL, and analyses and evaluates the performance of each algorithm roundly by experiments. FOIL’s improvement and analysis are raised by the uncertainty when FOIL learns the complex relations among non-neighbor Web pages and presents a new path-finding method based on relations of Web pages; Multi-strategy learning algorithm combines several learning algorithm reasonably, resolves the one-side problem in applying a single learning algorithm to extract flexible Web information, and makes the result rules show the Web information in all aspects; Learning algorithm based on filling-tag presents a bottom-up learning algorithm based on module cascade, fills lots of SGML tags, which is helpful to find information type to cover unseen information on the Web, controls information’s lose or overflow in the process of learning, and realizes intelligent Web information extraction. Moreover, when we apply these learning algorithms to a project for SFDA (State Food and Drag Administration), named “Web-Monitoring INformation of Drag “, the experimental results indicate that the three algorithms all above have larger improvements on recall and precision than the existing ones.