本文提出一种基于AdaBoost MH算法的有指导的汉语多义词消歧方法,该方法利用AdaBoost MH算法对决策树产生的弱规则进行加强,经过若干次迭代后,最终得到一个准确度更高的分类规则;并给出了一种简单的终止算法中迭代的方法;为获取多义词上下文中的知识源,在采用传统的词性标注和局部搭配序列等知识源的基础上,引入了一种新的知识源,即语义范畴,提高了算法的学习效率和排歧的正确率。通过对6个典型多义词和SENSEVAL3中文语料中20个多义词的词义消歧实验,AdaBoost MH算法获得了较高的开放测试正确率(85.75%)。
Abstract
An approach based on supervised AdaBoost MH learning algorithm for Chinese word sense disambiguation is presented. AdaBoost MH algorithm is employed to boost the accuracy of the weak decision stumps rules for trees and repeatedly calls a learner to finally produce a more accurate rule. A simple stopping criterion is also presented. In order to extract more contextual information, we introduce a new semantic categorization knowledge which is useful for improving the learning efficiency of the algorithm and accuracy of disambiguation, in addition to using two classical knowledge sources, part-of-speech of neighboring words and local collocations. AdaBoost MH algorithm making use of these knowledge sources achieves 85.75% disambiguation accuracy in open test for 6 typical polysemous words and 20 polysemous words of SENSEVAL3 Chinese corpus.
关键词
人工智能 /
自然语言处理 /
词义消歧 /
AdaBoost MH算法 /
多知识源
{{custom_keyword}} /
Key words
artificial intelligence /
natural language processing /
word sense disambiguation /
AdaBoost MH algorithm /
multiple knowledge sources
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] N. Ide, J. Veronis, Introduction to the special Issue on Word Sense Disambiguation: The State of the Art[J]. Computational Linguistics, ACL , 1998. 24 (1).
[2] D. Yarowsky. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods[A]. In: the 33rd Annual Meeting of ACL [C]. Massachusetts, 1995: 181 - 188.
[3] 李涓子,黄昌宁,杨尔弘. 一种自组织的汉语词义排歧方法[J]. 中文信息学报, 1999, 13 (3) : 1 - 8.
[4] H. T. Ng, Exemplar-based Word Sense Disambiguation: Some Recent Improvements[A]. In: proceeding of the 2nd Conference on Empirical Methods in Natural Language Processing, EMNLP, 1997.
[5] Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. Word-sense disambiguation using statistical methods[A]. In: proceedings of the 29th conference on Association for Computational Linguistics[C]. California, June 1991, 264 - 270.
[6] G. Towell, E. M. Voorhees, Disambiguating Highly Ambiguous Words [J]. Computational Linguistics, ACL, 1998. 24 (1).
[7] S. Abney, R. E. Schapire, Y. Singer. Boosting Applied to Tagging and PP-attachment [A]. In: proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Proceeding and Very larger Corpora [C]. 1999.
[8] R. E. Schapire, Y. Singer, BoosTexter. A Boosting-based System for Text Categorization [J]. Machine Learning. 2000. 39: 135 - 168.
[9] R. E. Schapire, Y. Singer, Improved Boosting Algorithms Using Confidence-rated Predictions [J]. Machine Learning. 1999. 38: 297 - 336.
[10] Christopher D. Manning and hinrich Schutze. Foundations of statistical natural language processing [M]. Cambridge: MIT Press, 1999.
[11] Walker, E. Donald, Knowledge resource tools for accessing large text files. In: proc. First Conference of the UW Centre for the New Oxford English Dictionary: Information in Data[C]. Waterloo, Canada. Nov. 6 - 7, 1995.
[12] Yarowsky, David. Word-sense disambiguation using statistical models of Roget’s categories trained on larger corpora[A]. ACL , 1992. 454 - 460.
[13] 梅家驹,等. 多义词词林[M]. 上海: 上海辞书出版社, 1996.
[14] Zheng-Yu Niu. and Dong-Hong Ji. Optimizing Feature Set for Chinese Word Sense Disambiguation [A]. SENSEVAL-3: Third International Workshop on the Evaluation of Systems [C]. Barcelona, Spain, July, 2004.
[15] H. T. Ng, Getting Serious about Word Sense Disambiguation [A]. In: proceedings of the SIGLEX Workshop “Tagging Textwith Lexical Semantics: Why, What and How?”[C] , 1997.
[16] G. A. Miller, R. Beckwith, C. Fellbaum, et al. Five Papers on Word Net[J]. Special Issue of International Journal of Lexicography. 1990.
[17] 董振东. 知网[E13/OL]. http://www.keenage.com 2000.
[18] R. Mihalcea, I. Moldovan. An Automatic Method for Generating Sense Tagged Corpora[A]. In: proceedings of the 16th National Conference on Artificial Intelligence[C] , 1999.
[19] Eneko Agirre, Olatz Ansa, Eduard Hovy and David Martinez. Enriching Very larger ontologies using the WWW [A]. In: proceedings of the Ontology Learning Workshop [C] , Berlin, 2000.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60373095;60373096)
{{custom_fund}}