智能与分布计算实验室
  语义桌面搜索技术研究
姓名 李胜
论文答辩日期 2008.06.03
论文提交日期 2008.06.05
论文级别 博士
中文题名 语义桌面搜索技术研究
英文题名 Research on the Technique of Semantic Desktop Search
导师1 卢正鼎
导师2 胡和平
中文关键词 语义桌面;语义网;本体;语义搜索;元数据;无结构文档;信息检索模型;结果排序
英文关键词 Semantic Desktop;Semantic Web;Ontology;Semantic Search;Metadata;Unstructured Document;Information Retrieval Model;Result Ranking
中文文摘 计算机技术的不断普及以及硬盘技术的迅速发展使个人计算机中的文档数量以惊人速度增长。如何有效地管理和利用这些文档,是我们需要解决的重要问题。近年出现的基于全文索引的桌面搜索工具在一定程度上可以解决部分问题,但是这些工具无法提供语义级的搜索服务,为用户找出那些与查询有着潜在关联的桌面资源。语义桌面概念的提出为解决桌面资源的管理问题创造了契机,它试图将语义网中的多种技术移植到个人计算机中,增强桌面管理的功能。 目前,国内外对语义桌面技术的研究还处于个案处理的初步阶段,并未形成通用的解决方法。在综合阅读国内外相关参考文献及分析研究现状的基础上,对语义桌面搜索中的若干关键技术进行了分类研究,包括元数据提取、无结构文档信息提取、桌面检索模型和检索结果排序等。 当前语义桌面的元数据处理方案只包含了与桌面文件静态属性相关的元数据,不能反映用户行为与文档之间的关系。为此,提出一种基于用户行为的动态元数据提取方案,该方案充分考虑桌面环境中与用户行为相关的桌面上下文,如电子邮件、文件目录、浏览器缓存等。同时,通过对用户隐反馈的分析来对上下文进行探测划分。创建元数据生成器,以本体的形式将桌面元数据保存在个人计算机中。 已有的语义桌面系统对无结构文档的处理能力较弱,其根本原因是难以从无结构文档中提取有效信息。在传统信息提取技术的基础上,给出了一种基于本体的无结构文档信息提取方案。该方案首先对文档建立本体加以描述,然后分析本体中实体之间的多种潜在关系,如文本相连关系、文本共存关系、高频率实体等,通过分析这些关系来确定各候选实体的匹配系数,从而确定被识别实体,并以XML的形式输出。实验表明,该方法能获得较高的识别率和准确率。 要对桌面文档进行搜索,需要某种信息检索模型来支持,而信息检索模型一直都是信息检索领域的重要研究课题。在对传统向量空间模型研究的基础上,设计了一种基于本体的语义信息检索模型,其工作重点包括:语义项权重的设计,各关键字之间的语义关系分析,以及语义特征向量之间的相似度计算策略等。在模型中,通过概念连通图对不同语义项之间的关系进行了重新考量,并将语义相似度的计算分为概念相似度和属性相似度两个方面,综合考虑了二者在语义检索中的作用,改善了检索效果。 对检索结果进行排序是文档检索的一个重要步骤。在对现有的Web排序算法和模式图理论研究后,提出一种基于权威传递的检索结果排序方法。该方法利用本体描述文档之间权威传递的模式,通过设定不同的传递系数,反映出文档之间的不同连接线索对文档联系紧密程度的影响。实验结果表明,该方法可将重要程度高的结果优先返回,并有效地体现出文档对象之间的关联关系。
英文文摘 The popularity of computer technology and the rapid development of magnetic data storage technology make the number of documents in the personal computer increasing at an alarming rate. How effectively manage these documents is an important issue for us. In recent years, some desktop search tools based on the full text index could solve part of problems. However, these tools could neither provide semantic-search service, nor identify the desktop resource with potential relevance with user’s query. The emergence of semantic desktop gives an opportunity for solving the problem of desktop resources management. It tries to transplant variety technology in semantic Web to the personal computer, accordingly enhancing the functions of desktop management. Now the research of semantic search is still on the primary stage, and only several research cases are reported. There is not any universal method for the research of semantic desktop. Based on reading related references and analyzing the research status, we research several key technologies of semantic desktop, which covers metadata extract, the entity recognition in unstructured documents, desktop retrieval model and searching result ranking. Current metadata treatment solution of semantic desktop involved only the static properties of the desktop context, which could not reflect the relationship between user's behavior and the documents. This paper presents a dynamic metadata extraction method based on user behavior that consider of desktop context relating to user’s behavior in desktop environment such as email, file folder, Web cache, etc. Then, we creating metadata generator and saving the metadata on personal computer with the form of Ontology. The unstructured document handling capacity of today’s semantic desktop is unsatisfied. After researching the traditional information extraction technology, we provided a method to extract information from unstructured document based on ontology. This method created ontology to descript the documents in some special domain, then analysis various latent relationships between entities, such as text-proximity relationships, text co-occurrence relationships, popular entities, etc. According to these relationships, we confirm the matching coefficient of each candidate entity. Finally, output the result with the form of XML. Experiment show that this method can obtain a higher recall and accurate. To search for documents on the desktop, we need a model to support information retrieval, and information model has always been an important research topic of information retrieval. We propose and implement an ontology-based semantic retrieval model. In order to overcome the shortcomings of traditional vector space model in dealing with semantic, we study the problems including statistical method in weight of semantic items, materialization of semantic relations between keywords, and the similarity between semantic vectors, and so on. In the model, the relations between different semantic items are computed, and the calculation of semantic-similarity is composed of two parts including concept-similarity and property-similarity. This method improves the performance of the IR system. The results ranking is an important component of document retrieval. According to the research of various Web search ranking algorithm and theory of pattern graph, a search results ranking method based on authority transfer is proposed. The method used ontology to descript authority transfer pattern between different entities in documents, setting different transfer coefficients, which reflect the impact of difference linking clue between documents. Experimental results show that the method can take the important result return with high priority, and effectively reflect the association relationship between document objects.