1. School of Computer & Information Technology,Shanxi University,Taiyuan,Shanxi 030006,China; 2. Key Laboratory of Ministry of Education for Computation Intelligence & Chinese Information Processing, Taiyuan,Shanxi 030006,China
Abstract:The frame element labeling still mainly adopts supervised machine learning methods, which rely on examples of large-scale artificial marked as the training corpus, in order to reduce the cost of manual annotation, this paper presentan active learning aproach, which selects the most uncertain samples for annotation instead of the whole training corpus. Experimental results show that the frame elements labeling F values rise about 4.83 percent by active learning when using the same amount of training samples. In other words, for about the same labeling performance, we only need annotate 70% of the samples as compared to the usual random selection method.
[1] Palmer M,Gildea D, Kingsbury P. The proposition bank: An annotated corpus of semantic roles[J]. Computational linguistics, 2005, 31(1): 71-106. [2] Baker C F, Fillmore C J, Lowe J B. The berkeley framenet project[C]//Proceedings of the 17th international conference on Computational linguistics-Volume 1. Association for Computational Linguistics, 1998: 86-90. [3] Pradhan S, Hacioglu K, Krugler V, et al. Support vector learning for semantic argument classification[J]. Machine Learning, 2005, 60(1-3): 11-39. [4] Cohn T, Blunsom P. Semantic role labelling with tree conditional random fields[C]//Proceedings of the Ninth Conference on Computational Natural Language Learning. Association for Computational Linguistics, 2005: 169-172. [5] 刘挺, 车万翔, 李生. 基于最大熵分类器的语义角色标注[J]. 软件学报,2007,(03): 565-573. [6] Gildea D, Jurafsky D. Automatic labeling of semantic roles[J]. Computational linguistics, 2002, 28(3): 245-288. [7] 李济洪, 王瑞波, 王蔚林, 李国臣. 汉语框架语义角色的自动标注[J]. 软件学报,2010,(04): 597-611. [8] Tang M,Luo X, Roukos S. Active learning for statistical natural language parsing[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2002: 120-127. [9] Olsson F. A literature survey of active machine learning in the context of natural language processing[J]. Swedish Institute of Computer Science,2009: 1-55. [10] Muslea I, Minton S, Knoblock C A. Active learning with multiple views[J]. Journal of Artificial Intelligence Research, 2006: 203-233. [11] McCallumzy A K, Nigamy K. Employing EM and pool-based active learning for text classification[C]//Proceedings of the Fifteenth International Conference, ICML. 1998. [12] Engelson S P, Dagan I. Minimizing manual annotation cost in supervised training from corpora[C]//Proceedings of the 34th annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 1996: 319-326. [13] Lewis D D, Catlett J. Heterogeneous uncertainty sampling for supervised learning[C]//Proceedings of the eleventh international conference on machine learning. 1994: 148-156. [14] Thompson C A, Califf M E, Mooney R J. Active learning for natural language parsing and information extraction[C]//Proceedings of the ICML. 1999: 406-414. [15] McCallum A, Nigam K. A comparison of event models for naive bayes text classification[C]//Proceedings of the AAAI-98 workshop on learning for text categorization. 1998, 752: 41-48. [16] Hwa R. Sample selection for statistical parsing[J]. Computational linguistics, 2004, 30(3): 253-276. [17] Ngai G, Yarowsky D. Rule writing or annotation: Cost-efficient resource usage for base noun phrase chunking[C]//Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2000: 117-125. [18] Tomanek K, Wermter J, Hahn U. An Approach to Text Corpus Construction which Cuts Annotation Costs and Maintains Reusability of Annotated Data[C]//Proceedings of the EMNLP-CoNLL. 2007: 486-495. [19] Settles B, Craven M,Friedland L. Active learning with real annotation costs[C]//Proceedings of the NIPS workshop on cost-sensitive learning. 2008: 1-10. [20] 覃刚力, 黄科, 杨家本. 基于主动学习的文档分类[J]. 计算机科学,2003,(10):45-48. [21] 宋鑫颖, 周志逵. 一种基于SVM的主动学习文本分类方法[J]. 计算机科学,2006, 33(11):288-290. [22] 居胜峰, 王中卿, 李寿山, 等. 情感分类中不同主动学习策略比较研究[J]. 中国计算语言学研究前沿进展 (2009-2011), 2011:506-511. [23] 冯冲, 陈肇雄, 黄河燕. 采用主动学习策略的组织机构名识别[J]. 小型微型计算机系统, 2006, 27(4): 710-714. [24] 车万翔, 张梅山, 刘挺. 基于主动学习的中文依存句法分析[J]. 中文信息学报, 2012, 26(2): 18-22. [25] Lafferty J D, Mccallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models For Segmenting And Labeling Sequence Data[C]//Proceedings of the Eighteenth International Conference on Machine Learning(ICML). 2001:282-289. [26] Rabiner L. A tutorial on hidden Markov models and selected applications in speech recognition[J]. Proceedings of the IEEE, 1989, 77(2): 257-286. [27] 王智强, 李茹, 阴志洲等. 基于依存特征的汉语框架语义角色自动标注[J]. 中文信息学报, 2013,27(2): 34-40. [28] Settles B, Craven M. An analysis of active learning strategies for sequence labeling tasks[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2008: 1070-1079.