名词短语的识别对句法分析等自然语言处理任务有着基础性的意义。目前,老挝语名词短语识别研究仍处于起步阶段,相较于其他语言,老挝语名词短语识别存在边界模糊、界定描述模糊、语料有限、句式过长等问题。针对以上问题,该文研究了老挝语名词短语的结构,并构建了融合其短语结构的多通道老挝语名词短语模型。模型通过将字符、词和词性特征组合形成不同的输入通道,使用多个BiLSTM网络从不同的方面提取更多隐藏信息,同时改善低资源语料存在大量未登录名词短语的问题。此外,由于老挝语句式过长,模型引入Attention机制,增加重要特征的权重,有效减少了无用信息的干扰。实验结果表明,该模型在有限标注语料下F1值达到85.25%,优于其他模型方法。
Abstract
The identification of noun phrases is of fundamental significance to natural language processing tasks such as syntactic analysis. At present, the study on the identification of Lao noun phrases is still in its infancy. Compared with other languages, the Lao has the problems such as fuzzy boundary, ambiguous definition description, limited corpus and excessively long sentences. This paper studies the structure of Lao noun phrases and builds the multi-channel model to identify Lao noun phrases. This model forms different channels by combining characters, words and POS features, and extract more hidden information from different aspects with multi BiLSTM networks, so as to alleviate the unenrolled noun phrases issue in low-resource corpus. To deal with the excessively long sentences in Lao, the model introduces the Attention mechanism to assign higher weight of important features, effectively abating the interference from useless information. The experimental results show that the F1 value of the model is up to 85.25% on a limited annotated corpus, which is better than other models and methods.
关键词
名词短语识别 /
BiLSTM /
多通道 /
Attention机制
{{custom_keyword}} /
Key words
identification of noun phrases /
BiLSTM /
multi-channel /
Attention mechanism
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 李荣,郑家恒.基于语料库的名词短语识别方法[J]. 济南大学学报(自然科学版), 2007(03): 58-60.
[2] 马建军,裴家欢,黄德根. CRFs融合语义信息的英语功能名词短语识别[J]. 中文信息学报, 2016, 30(6): 59-66.
[3] 李佳.融入依存关系的汉越组块对齐研究[D]. 昆明:昆明理工大学硕士学位论文,2018.
[4] 单义栋,王衡军,黄河,等.基于注意力机制的命名实体识别模型研究: 以军事文本为例[J].计算机科学,2019(B06): 111-114.
[5] 杨培,杨志豪,罗凌,等.基于注意机制的化学药物命名实体识别[J].计算机研究与发展,2018, 055(007): 1548-1556.
[6] Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging[J]. arXiv preprint arXiv:
1508.01991, 2015.
[7] 方芳,王石,王亚,等.基于混合方法的含动词名词短语识别研究[J].山西大学学报(自然科学版),2019,42(01): 36-45.
[8] 张文敏,李华勇,邵艳秋.汉语基本复合名词短语语义关系知识库构建与识别[J].中文信息学报, 2019, 33(12):28-36.
[9] Lai H, Zhao C, Yu Z, et al. Vietnamese noun phrase chunking based on BiLSTM-CRF model and constraint rules[C]//Proceedings of the CCF Conference on Big Data. Springer, Singapore, 2019: 89-104.
[10] Wei W, Wang Z, Mao X, et al. Enhancing neural sequence labeling with position-aware self-attention[J]. arXiv preprint arXiv:1908.09128, 2019.
[11] 王闻慧.融入语言学特征的越南语名词短语自动识别研究[D].洛阳: 战略支援部队信息工程大学硕士学位论文,2019.
[12] 邹宏梅,王挺.SVM和基于转换的错误驱动学习相结合的汉语组块识别[J].计算机工程与科学, 2007, 29(4): 91-94.
[13] 周雅倩,郭以昆,黄萱菁,等.基于最大熵方法的中英文基本名词短语识别[J].计算机研究与发展,2003(03): 61-67.
[14] 王月颖.中文最长名词短语识别研究[D].哈尔滨: 哈尔滨工业大学硕士学位论文,2007.
[15] 赵军,黄昌宁.基于转换的汉语基本名词短语识别模型[J].中文信息学报,1999,13(2): 2-8.
[16] 李业刚,黄河燕,鉴萍.引入混合特征的最大名词短语双向标注融合算法[J].自动化学报,2015,041(007): 1274-1282.
[17] 李业刚,黄河燕.汉语组块分析研究综述[J].中文信息学报,2013,27(3): 1-9.
[18] Naiquan H U, Qiaoming Z, Guodong Z. Hybrid method to Chinese base noun phrase recognition[J]. Computer Engineering, 2009, 35(20): 199-201.
[19] Lin B Y, Xu F F, Luo Z, et al. Multi-channel bilstm-crf model for emerging named entity recognition in social media[C]//Proceedings of the 3rd Workshop on Noisy User-generated Text. 2017: 160-165.
[20] Sang E F, Daelemans W, Déjean H, et al. Applying system combination to base noun phrase identification[J]. arXiv preprint cs/0008012, 2000.
[21] Mendes P N, Daiber J, Rajapakse R K, et al. Evaluating the Impact of Phrase Recognition on Concept Tagging[C]//Proceedings of the 8th International Confererce on Language Resources and Evaluation, 2012: 1277-1280.
[22] Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging[J]. arXiv preprint arXiv:1508.01991, 2015.
[23] 司念文,王衡军,李伟,等.基于注意力长短时记忆网络的中文词性标注模型[J].计算机科学,2018,45(4): 66-70,82.
[24] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735-1780.
[25] 李卫疆,漆芳.基于多通道双向长短期记忆网络的情感分析[J].中文信息学报,2019,33(12):119-128.
[26] 李琳,龙从军,江荻.藏语句法功能组块的边界识别[J].中文信息学报,2013,27(6): 165-169.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}