近年来,在大规模标注语料上训练的神经网络模型大大提升了命名实体识别任务的性能。但是,新领域人工标注数据获取代价高昂,如何快速、低成本地进行领域迁移就显得非常重要。在目标领域仅给定无标注数据的情况下,该文尝试自动构建目标领域的弱标注语料并对其建模。首先,采用两种不同的方法对无标注数据进行自动标注;然后,采用留“同”去“异”的方式,尽量减少错误标注,自动生成局部标注的语料;最终,该文提出一种新的基于局部标注学习的实体识别模型,该模型可以在弱标注数据上进行训练。新闻领域到社交领域和金融领域的迁移实验结果证明,该文方法能有效提升命名实体识别模型的领域自适应性能,并且迁移代价较低。在加入预训练语言模型BERT的条件下,该方法也表现出较好的性能。
Abstract
In recent years, neural network models trained on a large number of labeled samples boost the performance of named entity recognition. However, collecting enough labeled samples in various domains is very expensive, which reveals the importance of rapid domain transferring. Given only unlabeled data of a target domain, this paper attempts to automatically construct corpora with partial annotation in the target domain and model it. Firstly, we use two different methods to annotate unlabeled data automatically. Then, we keep consistent annotations while removing those with different annotations, which reduces erroneous annotations as much as possible and generates a partial-annotated corpus. Finally, we propose a new entity recognition model based on partial annotation learning. Experiment results of transferring from the news domain to the social-media domain as well as finance domain prove that the proposed approach effectively improves the domain adaptation performance of named entity recognition at a low cost. With the addition of pre-trained language model BERT, this method still exhibits good performance.
关键词
命名实体识别 /
领域适应 /
局部标注
{{custom_keyword}} /
Key words
named entity recognition /
domain adaptation /
partial annotation
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Lample G, Ballesteros M, Subramanian S, et al. Neural architectures for named entity recognition[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016: 260-270.
[2] Dong C, Zhang J, Zong C, et al. Character-based LSTM-CRF with radical-level features for Chinese named entity recognition[M]. Natural Language Understanding and Intelligent Applications.Cham: Springer, 2016: 239-250.
[3] 殷章志,李欣子,黄德根,等.融合字词模型的中文命名实体识别研究[J].中文信息学报,2019,33(11): 95-100.
[4] Mou L, Meng Z, Yan R, et al. How transferable are neural networks in NLP applications?[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2016: 479-489.
[5] 高冰涛,张阳,刘斌.BioTrHMM: 基于迁移学习的生物医学命名实体识别算法[J].计算机应用研究,2019,36(01): 45-48.
[6] 孔玲玲. 面向少量标注数据的中文命名实体识别技术研究[D].浙江: 浙江大学博士学位论文, 2019.
[7] Collobert R, Weston J. A unified architecture for natural language processing: deep neural networks with multitask learning[C]//Proceedings of the 25th International Conference on Machine Learning, 2008: 160-167.
[8] Yang Z, Salakhutdinov R, Cohen W W, et al. Transfer learning for sequence tagging with hierarchical recurrent networks[C]//Proceedings of the 5th International Conference on Learning Representations, 2017.
[9] Lin B Y, Lu W. Neural adaptation layers for cross-domain named entity recognition[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018: 2012-2022.
[10] Lee J Y, Dernoncourt F, Szolovits P. Transfer learning for named-entity recognition with neural networks[C]//Proceedings of the 11th International Conference on Language Resources and Evaluation, 2018.
[11] Liu Z, Zhu C, Zhao T. Chinese named entity recognition with a sequence labeling approach: based on characters, or based on words?[C]//Proceedings of the International Conference on Intelligent Computing. Springer, Berlin, Heidelberg, 2010: 634-640.
[12] Ruder S, Plank B. Strong baselines for neural semi-supervised learning under domain shift[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 1044-1054.
[13] Yang Y, Chen W, Li Z, et al. Distantly supervised NER with partial annotation learning and reinforcement learning[C]//Proceedings of the 27th International Conference on Computational Linguistics, 2018: 2159-2169.
[14] Greenberg N, Bansal T, Verga P, et al. Marginal likelihood training of BiLSTM-CRF for biomedical named entity recognition from disjoint label sets[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018: 2824-2829.
[15] Liu Y, Zhang Y, Che W, et al. Domain adaptation for CRF-based Chinese word segmentation using free annotations[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014: 864-874.
[16] 朱运,李正华,黄德朋,等.基于弱标注数据的汉语分词领域移植[J].中文信息学报,2019,33(09): 1-8.
[17] Li S, Zhao Z, Hu R, et al.Analogical reasoning on Chinese morphological and semantic relations[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 138-143.
[18] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735-1780.
[19] Tsuboi Y, Kashima H, Mori S, etal. Training conditional random fields using incomplete annotations[C]//Proceedings of the 22nd International Conference on Computational Linguistics. 2008: 897-904.
[20] Levow G A. The third international Chinese language processing bakeoff: word segmentation and named entity recognition[C]//Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing, 2006: 108-117.
[21] Peng N, Dredze M. Named entity recognition for Chinese social media with jointly trained embeddings[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2015: 548-554.
[22] Zhang Y, Yang J. Chinese NER using lattice LSTM[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 1554-1564.
[23] Peters M, Neumann M, Iyyer M, et al. Deep contextualized word representations[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018: 2227-2237.
[24] Devlin J, Chang M W, Lee K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 4171-4186.[25] Wolf T, Debut L, Sanh V, et al. HuggingFaces Transformers: State-of-the-art Natural Language Processing[J]. arXiv preprint arXiv: 1910.03771, 2019.
[26] Jia C, Liang X, Zhang Y. Cross-domain NER using cross-domain language modeling[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 2464-2474.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61525205,61876115)
{{custom_fund}}