汉语框架网的低覆盖率导致汉语句子中存在许多未登录的词元,严重制约着汉语的框架语义分析任务。针对未登录词元的框架识别问题,该文借助同义词词林的词义信息,提出基于平均语义相似度计算及最大熵模型两种方法,采用静态特征与动态特征相结合的特征选择方法。实验证明,这两种方法都能有效地实现未登录词元的框架选择,基于相似度计算的方法(TOP-4)获得78.61%的准确率;基于最大熵的方法结果可达87.29%,同时在新闻语料上达到了75%的准确率。
Abstract
The low coverage of Chinese FrameNet leads to many unknown lexical units and restricts the frames semantic analysis for Chinese. In order to identify frames for unknown lexical units, this paper proposes two methods based on Tongyici CiLin: the Average Semantic Similarity method and Maximum Entropy (ME-based) method which both combine the static features and dynamic features. Experiments show that the two methods can effectively identify the frame of unknown lexical units: the accuracy of the similarity-based method is 78.61% considering Top-4 candidates; the Top-1 accuracy of the ME-based method for the same test set is 87.29% (and 75% for another news texts).
关键词
汉语框架网 /
未登录词元 /
词义信息
{{custom_keyword}} /
Key words
Chinese FrameNet /
unknown lexical unit /
word sense information
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 郝晓燕,刘伟,李茹,等.汉语框架语义知识库及软件描述体系[J].中文信息学报,2007,21(5):96-138.
[2] 李双红.基于框架核心语义依存图的句子相似度计算[D].山西大学硕士学位论文,2010.
[3] C J Fillmore. Frame semantics and the nature of language[J].Annals of the New York Academy of Sciences, 1976: 20-32.
[4] C Baker, M Ellsworth, K Erk. SemEval-2007 Task 19: Frame semantic structure extraction[C]//Procee-dings of the 4th International Workshop on Semantic Evaluations (SemEval-2007). Prague: Czech Republic, 2007: 99-104.
[5] M Pennacchiotti,D De Cao,R Basili,D Croce, et al. Automatic induction of FrameNet lexical units[C]//Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Honolulu, 2008: 457-465.
[6] A Burchardt, K Erk, A Frank. A WordNet detour to FrameNet[C]//Proceedings of the GLDV 2005 GermaNet II Workshop. Bonn, Germany, 2005.
[7] R Johansson, P Nugues. Using WordNet to extend FrameNet coverage[C]//Proceedings of the Workshop on Building Frame-semantic Resources for Scandinavian and Baltic Languages. Tartu, 2007.
[8] Dipanjan Das, Noah A Smith. Semi-Supervised Frame-Semantic Parsing for Unknown Predicate[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, Oregon, 2011: 1435-1444.
[9] Wanxiang Che, Zhenghua Li, Ting Liu. LTP: A Chinese Language Technology Platform[C]//Proceedings of the Coling 2010: Demonstrations. Beijing, China, 2010: 13-16.
[10] Ru Li, Shuanghong Li, Zezheng Zhang. The Semantic Computing Model of Sentence Similarity Based on Chinese FrameNet[C]//Proceedings of 2009 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Toronto, Canada, 2009: 255-258.
[11] 田久乐,赵蔚.基于同义词词林的词语相似度计算方法[J].吉林大学学报(信息科学版),2010,28(6):602-608.
[12] 张海军,栾静,李勇,等.基于统计学习框架的中文新词检测方法[J].中文信息学报,2012,39(2)232-235.
[13] 穗志芳,俞士汶.汉语单句谓语中心词识别知识的获取及应用[J].北京大学学报(自然科学版),1998,34(2-3): 221-229.
[14] 赫兰光,王轩,李露,范士喜.基于最大熵分类器的谓词识别与词义消歧[C]//第四届全国信息检索与内容安全学术会议论文集(上).北京,2008:20-28.
[15] 俞士汶,段慧明,朱学锋,孙斌.北京大学现代汉语语料库基本加工规范[J].中文信息学报,2002,16(5): 49-64.
[16] 俞士汶,段慧明,朱学锋,孙斌.北京大学现代汉语语料库基本加工规范(续)[J].中文信息学报,2002,16(6): 58-64.
[17] 刘挺,车万翔,李正华. 语言技术平台[J]. 中文信息学报,2011,25(6): 53-61.
[18] Zhang Le. Maximum entropy modeling toolkit for python and c++ [CP].2005. http://homepages.inf.ed.ac.uk/s0450736/maxent toolkit.html.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(60970053,61373082);国家语委“十二五”科研规划项目(YB125-19);国家863高技术研究发展计划项目(2006AA0lZ142);山西省回国留学人员科研资助项目(20B-05)
{{custom_fund}}