基于置信度的藏文人名识别的主动学习模型研究

王志娟,刘飞飞,赵小兵,宋伟

PDF(861 KB)
PDF(861 KB)
中文信息学报 ›› 2019, Vol. 33 ›› Issue (8) : 53-59.
民族、跨境及周边语言信息处理

基于置信度的藏文人名识别的主动学习模型研究

  • 王志娟1,2,刘飞飞3,赵小兵1,2,宋伟1
作者信息 +

Confidence Based Active Learning Model for Tibetan Person Name Recognition

  • WANG Zhijuan1,2, LIU Feifei3, ZHAO Xiaobing1,2, SONG Wei1
Author information +
History +

摘要

训练语料的标注成本是资源稀缺语言处理研究面临的一个重要问题,通过主动学习(active learning)方法可以选择信息量大、无冗余的语料供人工标注,进而大大降低语料标注成本。该文基于CRF模型给出的标注置信度提出了四种主动学习方法,并通过实验确定了这四种主动学习方法的相关参数。实验显示:选择置信度低于0.7的语料进行人工标注,直到新旧模型标注结果的差异度小于0.01%时,仅需6轮迭代;人工标注3.2MB的语料,藏文人名识别的F值可以达到88%,若要达到该识别效果,基于CRF的监督式学习模型需要标注约10MB的语料,该主动学习方法降低了约66%的语料标注规模。

Abstract

To alleviate the issue of labeling cost of training data for low resource languages, the active learning is a promising method by selecting the informative data without redundancy. Four active learning methods based on the confidence are proposed, with the parameters decided empirically. The experimental results: selecting the data with confidence below 0.7 and 6 iteration of labeling with up to 3.2MB training data, we can achieve 0.88 F-measure for Tibetan name recognition. Compare with the 10MB training data for CRF model to achieve the same performance (with no more than 0.01% difference), the active learning approach reduces the annotation scale by 66%.

关键词

藏文人名识别 / 主动学习 / 置信度

Key words

Tibetan person name recognition / active learning / confidence

引用本文

导出引用
王志娟,刘飞飞,赵小兵,宋伟. 基于置信度的藏文人名识别的主动学习模型研究. 中文信息学报. 2019, 33(8): 53-59
WANG Zhijuan, LIU Feifei, ZHAO Xiaobing, SONG Wei. Confidence Based Active Learning Model for Tibetan Person Name Recognition. Journal of Chinese Information Processing. 2019, 33(8): 53-59

参考文献

[1] Nadeau D,Sekine S.A survey of named entity recognition and classification [J].Lingvisticae Investigations,2007,30(1):3-26.
[2] 赵军.命名实体识别、排歧和跨语言关联[J].中文信息学报,2009,23(2):3-17.
[3] Settles B.Active learning literature survey [D].University of Wisconsinmadison,2009,39(2):127-131.
[4] Culotta A,Kristjansson T,Mccallum A,et al.Corrective feedback and persistent learning for information extraction[J].Artificial Intelligence,2006,170(14-15):1101-1122.
[5] Hoi S C H,Jin R,Lyu M R.Large-scale text categorization by batch mode active learning[C]//Proceedings of the 15th International Conference on World Wide Web,ACM,2006:633-642.
[6] Ringger E,Mcclanahan P,Haertel R,et al.Active learning for part-of-speech tagging:Accelerating corpus annotation[C]//Proceedings of Linguistic Annotation Workshop.Association for Computational Linguistics,2007:101-108.
[7] Reichart R,Rappoport A.An ensemble method for selection of high quality parses[C]//Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics,2007:408-415.
[8] Kuo J S,Li H,Yang Y K.Learning transliteration lexicons from the web[C]//Proceedings of International Conference on Computational Linguistics and the Meeting of the Association for Computational Linguistics,2006:1129-1136.
[9] Shen D,Zhang J,Su J,et al.Multi-criteria-based active learning for named entity recognition[C]//Proceedings of Meeting on Association for Computational Linguistics,2004:589-596.
[10] Chen Y,Lasko T A,Mei Q,et al.A study of active learning methods for named entity recognition in clinical text[J].Journal of Biomedical Informatics,2015,58(C):11-18.
[11] Yao L,Sun C,Li S,et al.CRF-based active learning for chinese named entity recognition[C]//Proceedings of 2009 IEEE International Conference on Systems,Man and Cybernetics,2009:1557-1561.
[12] Tran V C,Nguyen N T,Fujita H,et al.A combination of active learning and self-learning for named entity recognition on Twitter using conditional random fields [J].Knowledge-Based Systems,2017,132:179-187.
[13] Yu H,Jiang T,Ma N.Named entity recognition for Tibetan texts using case-auxiliary grammars[J]//Proceedings of International Multi Conference of Engineers and Computer Scientists.2010,2180(1).
[14] Sun Y,Yan X,Zhao X,et al.Research on automatic recognition of Tibetan personal names based on multi-features[C]//Proceedings of the International Conference on Natural Language Processing and Knowledge Engineering.IEEE,2010:1-5.
[15] 加羊吉,李亚超,宗成庆,等.最大熵和条件随机场模型相融合的藏文人名识别[J].中文信息学报,2014,28(1):107-112.
[16] 华却才让,姜文斌,赵海兴,等.基于感知机模型藏文命名实体识别[J].计算机工程与应用,2014,50(15):172-176.
[17] 康才畯,龙从军,江荻.基于条件随机场的藏文人名识别研究[J].计算机工程与应用,2015,51(3):109-111.
[18] 珠杰,李天瑞,刘胜久.基于条件随机场的藏文人名识别技术研究[J].南京大学学报(自然科学),2016,52(2):289-299.
[19] 吴伟宁,刘扬,郭茂祖,等.基于采样策略的主动学习算法研究进展[J].计算机研究与发展,2012,49(6):1162-1173.
[20] Lafferty John D,McCallum,et al.Conditional random fields:Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of 18th International Conference on Machine Learning,2001:282-289.
[21] 刘飞飞,王志娟.基于层次特征的藏文人名识别研究[J/OL].计算机应用研究,2018(09):1-7 [2018-05-14].http://kns.cnki.net/kcms/detail/51.1196.TP.20170828.1023.066.html.

基金

国家自然科学基金(61331013,61501529)
PDF(861 KB)

823

Accesses

0

Citation

Detail

段落导航
相关文章

/