面向专业领域的多头注意力中文分词模型——以西藏畜牧业为例

崔志远,赵尔平,雒伟群,王伟,孙浩

PDF(1780 KB)
PDF(1780 KB)
中文信息学报 ›› 2021, Vol. 35 ›› Issue (7) : 72-80.
民族、跨境及周边语言信息处理

面向专业领域的多头注意力中文分词模型——以西藏畜牧业为例

  • 崔志远,赵尔平,雒伟群,王伟,孙浩
作者信息 +

Multi-Head Attention for Domian Specific Chinese Word Segmentation Model — A Case Study on Tibet’s Animal Husbandry

  • CUI Zhiyuan, ZHAO Erping, LUO Weiqun, WANG Wei, SUN Hao
Author information +
History +

摘要

专业领域语料往往比通用领域语料涵盖更多未登录词,如西藏畜牧业领域语料包含大量直接音译或者合成的人名、地名、牲畜名、牧草名等未登录词,是造成分词准确率低的最主要原因。针对该问题,该文提出面向专业领域的多头注意力机制中文分词模型,该模型把字向量表示的语句作为输入,利用双向门控循环神经网络与多头注意力机制学习字向量的上下文语义特征及它们之间的关系特征;为了让模型关注重点字向量之间的依赖关系及切分点信息,引入多头注意力机制,在不考虑字向量之间距离的前提下并行计算重要字向量与其它字向量的相关度,关注重要字对模型的贡献度;然后使用条件随机场学习词位标签,输出最优分词序列;最后构建领域词典进一步提高分词效果。在西藏畜牧业领域语料库进行实验,结果证明,该模型与BiLSTM-CRF等经典模型比较,精确率、召回率、F1值分别提高了3.93%、5.3%、3.63%,有效改善了西藏畜牧业领域语料的分词效果。

Abstract

Domain specific corpora such as Tibetan animal husbandry corpus are rich in direct transliteration or synthesis of unknown words. To improve the word segmentation for such corpora, this paper proposes a Chinese word segmentation model via Multi-Head Attention. To capture the dependence relationship and syncopation point information between key character vectors, the Multi-Head Attention mechanism is applied to calculate the correlation between important character vectors and other character vectors in parallel regardless the distance between them. Then the conditional random fields is employed to model lexeme labels for the optimal word segmentation sequence. Finally, a domain dictionary is constructed to further improve the effect of word segmentation. Experiments on the corpus of animal husbandry in Tibet show that, compared with classical models such as Bi-LSTM-CRF, the accuracy, recall rate and F1 value of the proposed model are increased by 3.93%, 5.3% and 3.63%, respectively.

关键词

中文分词 / 多头注意力机制 / 双向门控循环神经网络 / 西藏畜牧业语料

Key words

Chinese word segmentation / Multi-Head Attention / BiGRU / Tibetan animal husbandry corpus

引用本文

导出引用
崔志远,赵尔平,雒伟群,王伟,孙浩. 面向专业领域的多头注意力中文分词模型——以西藏畜牧业为例. 中文信息学报. 2021, 35(7): 72-80
CUI Zhiyuan, ZHAO Erping, LUO Weiqun, WANG Wei, SUN Hao. Multi-Head Attention for Domian Specific Chinese Word Segmentation Model — A Case Study on Tibet’s Animal Husbandry. Journal of Chinese Information Processing. 2021, 35(7): 72-80

参考文献

[1] 曹勇刚,曹羽中,金茂忠,等.面向信息检索的自适应中文分词系统[J].软件学报,2006,17(03):356-363.
[2] 成于思,施云涛.面向专业领域的中文分词方法[J].计算机工程与应用,2018,54(17):30-34,109.
[3] 钱智勇,周建忠,童国平,等.基于HMM的楚辞自动分词标注研究[J].图书情报工作,2014,58(04):105-110.
[4] 韩冰,刘一佳,车万翔等.基于感知器的中文分词增量训练方法研究[J].中文信息学报,2015,29(05):49-54.
[5] Peng F, Feng F, McCallum A. Chinese segmentation and new word detection using conditional random fields[C]//Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, 2004: 562.
[6] Collobert R , Weston J , Bottou, Léon, et al. Natural language processing (Almost) from scratch[J]. Journal of Machine Learning Research, 2011, 12(1):2493-2537.
[7] Peters M E, Neumann M,Iyyer M, et al. Deep contextualized word representations[C]//Proceedings of NAACL-HLT, 2018: 2227-2237.
[8] Xipeng Q, TianXiang S, Yige X, et al. Pre-trained models for natural language processing: A Survey[J]. Science China (Technological Sciences),2020,63(10):1872-1897.
[9] Yang Z, Dai Z, Yang Y, et al. Xlnet: Generalized autoregressive pretraining for language understanding[C]//Proceedings of the Advances in Neural Information Processing Systems, 2019: 5753-5763.
[10] 张婧, 黄德根, 黄锴宇, 等. 基于 λ-主动学习方法的中文微博分词[J]. 清华大学学报 (自然科学版), 2018, 58(3): 260-265.
[11] 姚茂建,李晗静,吕会华,等.基于BI_LSTM_CRF神经网络的序列标注中文分词方法[J].现代电子技术,2019,42(01):95-99.
[12] Chen X,Qiu X, Zhu C, et al. Gated recursive neural network for Chinese word segmentation[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 2015: 1744-1753.
[13] 许峰,张雪芬,忻展红.基于深度神经网络模型的中文分词方案[J].哈尔滨工程大学学报,2019,40(09):1662-1666.
[14] 李雪莲,段鸿,许牧.基于门循环单元神经网络的中文分词法[J].厦门大学学报(自然科学版),2017,56(02):237-243.
[15] 邓丽萍,罗智勇.基于半监督CRF的跨领域中文分词[J].中文信息学报,2017,31(4):9-19.
[16] 倪维健,孙浩浩,刘彤等.面向领域文献的无监督中文分词自动优化方法[J].数据分析与知识发现,2018,2(02):96-104.
[17] 搜狗公司.互联网词库(SogouW)[DB/OL].https://www.sogou.com/labs/resource/w.php,2016.
[18] Shiyi Han, Yuhui Zhang, Yunshan Ma, et al. THUOCL: Tsinghua Open Chinese Lexicon[DB]. 2016.
[19] HanHe.HanLP: Han Language Processing[DB/OL].https://github.com/hankcs/HanLP,2014.
[20] 中国科学院西北高原生物研究所.科学数据库[DB/OL].http://www.nwipb.cas.cn/,2020.
[21] Yijia Liu,Yue Zhang,Wanxiang Che, et al.Domain adaptation for CRF-based Chinese word segmentation using free annotations[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, Vol. 2 (EMNLP 2014), 2014:864-874.
[22] 俞士汶,段慧明,朱学锋, 等.北京大学现代汉语语料库基本加工规范[J].中文信息学报,2002,16(5):49-64.

基金

国家自然科学基金(61762082);西藏自治区自然科学基金(XZ2018ZRG-66);西藏自治区科技计划项目(XZ202001ZY0055G)
PDF(1780 KB)

Accesses

Citation

Detail

段落导航
相关文章

/