智能与分布计算实验室
  语义匹配在信息监管系统中的应用研究
姓名 程莉
论文答辩日期 2003.05.09
论文提交日期 2005.08.06
论文级别 硕士
中文题名 语义匹配在信息监管系统中的应用研究
英文题名 Application and Research on Semantic Matching in the Information Administration System
导师1 卢正鼎
导师2
中文关键词 语义计算;相似度;相关度;文本分类;向量空间模型;隐含语义索引;奇异值分解
英文关键词 semantic compute;similarity;relevancy;text classification;Vector Space Mode;Latent Semantic Indexing;Singular Value Decomposition
中文文摘 改善传统信息匹配分类系统性能的一个有效途径就是根据文本的语义或者说概念主题来进行分类。 基于《知网》的语义计算匹配分类,在计算词语、句子间相似度时给出了“整体相似度等于部分相似度加权平均”的策略。首先将一个整体分解成部分,再将两个整体的各个部分组合配对,通过计算每个组合对相似度的加权平均得到整体的相似度。对概念语义表达式反复使用这一策略,可将两个语义表达式的整体相似度分解成一些义原对的相似度组合。义原间的相似度由其语义距离转换计算得到。 提出的基于隐含语义索引(LSI)的kNN文本分类方法属于LSI在中文文本分类方面的应用。该方法既充分利用了向量空间模型在表示方法上的巨大优势,又弥补了其不考虑文本语义的不足。基本思想是利用文本中词与词之间存在的某种潜在语义结构,先抽取能反应文本的关键词,通过分析关键词之间的关联和潜在的语义关系来进行文本匹配分类。 本研究是应国家药品监督管理局的要求,对Internet网上涉药站点自动搜索,发现违规信息及时报警。原型系统经试用取得了良好效果。
英文文摘 An effective solution to improve the traditional information classifying system performances is to classify information according to the semantics or the conception subjects. The method of computing the semantic similarity among words or sentences based on Hownet gives a strategy that the integrative similarity equals to the weighted average of every part similarity. Decomposing an integrator into some parts, we can get two integrators’ similarity by computing the weighted average similarities between the disassembled counterparts. In this way, the integrative similarity between two semantic expressions can be decomposed into couples of primitive similarity. Primitive similarity is computed by converting the semantic distance. The k-Nearest Neighbor (kNN) text classification based on Latent Semantic Indexing (LSI) is an application of LSI in the Chinese text classification field. It takes the advantages that Vector Space Mode (VSM) has in expression, and makes up the shortage ignoring semantic in VSM. The main idea is to pick out some text keywords, and analyze the latent semantic relations among them. We can match and classify the documents by this way efficaciously. This research, which is at the request of the State Drug Administration, can be applied to search iatrical web-sites automatically on the Internet and give a warning if there were any illegal information. The demo system has achieved a very good effect.