训练语料的标注成本是资源稀缺语言处理研究面临的一个重要问题,通过主动学习(active learning)方法可以选择信息量大、无冗余的语料供人工标注,进而大大降低语料标注成本。该文基于CRF模型给出的标注置信度提出了四种主动学习方法,并通过实验确定了这四种主动学习方法的相关参数。实验显示:选择置信度低于0.7的语料进行人工标注,直到新旧模型标注结果的差异度小于0.01%时,仅需6轮迭代;人工标注3.2MB的语料,藏文人名识别的F值可以达到88%,若要达到该识别效果,基于CRF的监督式学习模型需要标注约10MB的语料,该主动学习方法降低了约66%的语料标注规模。
Abstract
To alleviate the issue of labeling cost of training data for low resource languages, the active learning is a promising method by selecting the informative data without redundancy. Four active learning methods based on the confidence are proposed, with the parameters decided empirically. The experimental results: selecting the data with confidence below 0.7 and 6 iteration of labeling with up to 3.2MB training data, we can achieve 0.88 F-measure for Tibetan name recognition. Compare with the 10MB training data for CRF model to achieve the same performance (with no more than 0.01% difference), the active learning approach reduces the annotation scale by 66%.
关键词
藏文人名识别 /
主动学习 /
置信度
{{custom_keyword}} /
Key words
Tibetan person name recognition /
active learning /
confidence
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Nadeau D,Sekine S.A survey of named entity recognition and classification [J].Lingvisticae Investigations,2007,30(1):3-26.
[2] 赵军.命名实体识别、排歧和跨语言关联[J].中文信息学报,2009,23(2):3-17.
[3] Settles B.Active learning literature survey [D].University of Wisconsinmadison,2009,39(2):127-131.
[4] Culotta A,Kristjansson T,Mccallum A,et al.Corrective feedback and persistent learning for information extraction[J].Artificial Intelligence,2006,170(14-15):1101-1122.
[5] Hoi S C H,Jin R,Lyu M R.Large-scale text categorization by batch mode active learning[C]//Proceedings of the 15th International Conference on World Wide Web,ACM,2006:633-642.
[6] Ringger E,Mcclanahan P,Haertel R,et al.Active learning for part-of-speech tagging:Accelerating corpus annotation[C]//Proceedings of Linguistic Annotation Workshop.Association for Computational Linguistics,2007:101-108.
[7] Reichart R,Rappoport A.An ensemble method for selection of high quality parses[C]//Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics,2007:408-415.
[8] Kuo J S,Li H,Yang Y K.Learning transliteration lexicons from the web[C]//Proceedings of International Conference on Computational Linguistics and the Meeting of the Association for Computational Linguistics,2006:1129-1136.
[9] Shen D,Zhang J,Su J,et al.Multi-criteria-based active learning for named entity recognition[C]//Proceedings of Meeting on Association for Computational Linguistics,2004:589-596.
[10] Chen Y,Lasko T A,Mei Q,et al.A study of active learning methods for named entity recognition in clinical text[J].Journal of Biomedical Informatics,2015,58(C):11-18.
[11] Yao L,Sun C,Li S,et al.CRF-based active learning for chinese named entity recognition[C]//Proceedings of 2009 IEEE International Conference on Systems,Man and Cybernetics,2009:1557-1561.
[12] Tran V C,Nguyen N T,Fujita H,et al.A combination of active learning and self-learning for named entity recognition on Twitter using conditional random fields [J].Knowledge-Based Systems,2017,132:179-187.
[13] Yu H,Jiang T,Ma N.Named entity recognition for Tibetan texts using case-auxiliary grammars[J]//Proceedings of International Multi Conference of Engineers and Computer Scientists.2010,2180(1).
[14] Sun Y,Yan X,Zhao X,et al.Research on automatic recognition of Tibetan personal names based on multi-features[C]//Proceedings of the International Conference on Natural Language Processing and Knowledge Engineering.IEEE,2010:1-5.
[15] 加羊吉,李亚超,宗成庆,等.最大熵和条件随机场模型相融合的藏文人名识别[J].中文信息学报,2014,28(1):107-112.
[16] 华却才让,姜文斌,赵海兴,等.基于感知机模型藏文命名实体识别[J].计算机工程与应用,2014,50(15):172-176.
[17] 康才畯,龙从军,江荻.基于条件随机场的藏文人名识别研究[J].计算机工程与应用,2015,51(3):109-111.
[18] 珠杰,李天瑞,刘胜久.基于条件随机场的藏文人名识别技术研究[J].南京大学学报(自然科学),2016,52(2):289-299.
[19] 吴伟宁,刘扬,郭茂祖,等.基于采样策略的主动学习算法研究进展[J].计算机研究与发展,2012,49(6):1162-1173.
[20] Lafferty John D,McCallum,et al.Conditional random fields:Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of 18th International Conference on Machine Learning,2001:282-289.
[21] 刘飞飞,王志娟.基于层次特征的藏文人名识别研究[J/OL].计算机应用研究,2018(09):1-7 [2018-05-14].http://kns.cnki.net/kcms/detail/51.1196.TP.20170828.1023.066.html.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61331013,61501529)
{{custom_fund}}