基于词信息嵌入的汉语构词结构识别研究

郑婳,刘扬,殷雅琦,王悦,代达劢

PDF(3888 KB)
PDF(3888 KB)
中文信息学报 ›› 2022, Vol. 36 ›› Issue (5) : 31-40,66.
专题:CCL2021优秀论文

基于词信息嵌入的汉语构词结构识别研究

  • 郑婳1,2,刘扬1,2,殷雅琦1,2,王悦1,2,代达劢1,2
作者信息 +

Chinese Word-Formation Prediction Based on Lexical Level Embedding

  • ZHENG Hua1,2, LIU Yang1,2, YIN Yaqi1,2, WANG Yue1,2, DAI Damai1,2
Author information +
History +

摘要

作为一种意合型语言,汉语中的构词结构刻画了构词成分之间的组合关系,是认知、理解词义的关键。在中文信息处理领域,此前的构词结构识别工作大多沿用句法层面的粗粒度标签,且主要基于上下文等词间信息建模,忽略了语素义、词义等词内信息对构词结构识别的作用。该文采用语言学视域下的构词结构标签体系,构建汉语构词结构及相关信息数据集,提出了一种基于Bi-LSTM和self-attention的模型,以此来探究词内、词间等多方面信息对构词结构识别的潜在影响和能达到的性能。实验取得了良好的预测效果,准确率达77.87%,F1值为78.36%;同时,对比测试揭示,词内的语素义信息对构词结构识别具有显著的贡献,而词间的上下文信息贡献较弱且带有较强的不稳定性。

Abstract

As a paratactic language, Chinese word-formations designate how the formation components combine to form words and become the key to understand semantics. In Chinese Natural Language Processing, most existing works on word-formation prediction follow the coarse-grained syntactic labels and use inter-word features in the context, regardless of the inner-word features like morphemes and lexical semantics. In this paper, we follow the word-formation labels defined from the linguistic perspective and construct a formation-informed Chinese dataset. We then propose a Bi-LSTM-based model with self-attention to explore how the inner- and inter-word features influence the Chinese word-formation prediction. Experimental results show that our method achieves high accuracy (77.87%) and F1 score (78.36%) on the word-formation task. Comparative analyses further show that morphemes (as an inner-word feature) greatly improve the prediction results, whereas the context (as an inter-word feature) performs the worst and shows strong instability.

关键词

汉语构词结构 / 词信息 / 语素

Key words

Chinese word-formation / word features / morphemes

引用本文

导出引用
郑婳,刘扬,殷雅琦,王悦,代达劢. 基于词信息嵌入的汉语构词结构识别研究. 中文信息学报. 2022, 36(5): 31-40,66
ZHENG Hua, LIU Yang, YIN Yaqi, WANG Yue, DAI Damai. Chinese Word-Formation Prediction Based on Lexical Level Embedding. Journal of Chinese Information Processing. 2022, 36(5): 31-40,66

参考文献

[1] 马建忠. 马氏文通[M]. 北京: 商务印书馆, 1898.
[2] 赵元任. 中国话的文法[M]. 香港: 香港中文大学出版社, 1980.
[3] 朱德熙. 语法讲义[M]. 北京: 商务印书馆, 1982.
[4] 谭景春. 词的意义、结构的意义与词典释义[J]. 中国语文, 2000, 02: 10-18.
[5] 曹炜. 现代汉语词义学[M]. 上海: 学林出版社, 2001.
[6] 苏宝荣. 词(语素)义与结构义[J]. 语文研究, 2001, 01: 1-5.
[7] 杨梅. 现代汉语合成词构词研究[D]. 南京: 南京大学博士学位论文, 2006.
[8] 吉志薇, 冯敏萱. 面向普通未登录词理解的二字词语义构词研究[J]. 中文信息学报, 2015,63(03): 251-258.
[9] 田元贺, 刘扬. 汉语未登录词的词义知识表示及语义预测[J]. 中文信息学报, 2016,30(6): 26-34.
[10] 陈龙, 饶琪, 刘扬. 汉语词的非字面义的表示与应用[J]. 中国科学: 信息科学, 2009: 49: 1005-1018.
[11] Zheng H, Li L, Dai D, et al. Leveraging Word-Formation Knowledge for Chinese Word Sense Disambiguation [C]//Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, 2021: 918-923.
[12] Zheng H, Dai D, Li L, et al. Decompose,fuse and generate: A formation-informed method for chinese definition generation [C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021: 5524-5531.
[13] Li Z. Parsing the internal structure of words: A new paradigm forchinese word segmentation[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human language technologies. 2011: 1405-1414.
[14] Zhang M, Zhang Y, Che W, et al. Chinese parsing exploiting characters[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 2013: 125-134.
[15] 孙静, 方艳, 丁彬, 等. 利用扩展标记集的词结构分析[J]. 中文信息学报, 2014,28(05): 39-45+82.
[16] Zheng X, Chen H, Xu T. Deep learning for Chinese word segmentation and POS tagging[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2013: 647-657.
[17] Gui T, Zhang Q, Gong J, et al. Transferring from formal newswire domain with hypernet for twitter pos tagging[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018: 2540-2549.
[18] Wang Y, Wang M, Fujita H. Word sense disambiguation: A comprehensive knowledge exploitation framework[J]. Knowledge-Based Systems, 2020, 190, p.105030.
[19] 刘扬, 林子, 康司辰. 汉语的语素概念提取与语义构词分析[J]. 中文信息学报, 2018,32(02): 12-21.
[20] 郭绍虞. 汉语语法修辞新探[M]. 北京: 商务印书馆, 1979.
[21] 陆志韦. 汉语的构词法[M]. 北京: 科学出版社, 1963.
[22] 王洪君. 汉语语法的基本单位与研究策略[J]. 语言教学与研究,2000,02: 10-18.
[23] 张国宪. 并列式合成词的语义构词原则与中国传统文化[J]. 汉语学习, 1992,(05): 28-31.
[24] 朱彦. 汉语复合词语义构词法研究[D]. 上海: 华东师范大学博士学位论文, 1982.
[25] 刘叔新. 汉语描写词汇学[M]. 北京: 商务印书馆, 1990.
[26] 徐通锵. 核心字和汉语的语义构辞法研究[J]. 语文研究, 1997,03: 2-16.
[27] 傅爱平. 汉语信息处理中单字的构词方式与合成词的识别和理解[J]. 语言文字应用,2003,(04): 25-33.
[28] 苑春法, 黄昌宁.基于语素数据库的汉语语素及构词研究[J]. 世界汉语教学, 1998,(02): 8-13.
[29] 尹斌庸. 汉语语素的定量研究[J]. 中国语文, 1984,(05): 338-347.
[30] 徐枢. 语素[M]. 北京: 人民教育出版社, 1990.
[31] Qiu S, Qing C, Jiang B, et al. Co-learning of word representations and morpheme representations[C]//Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers, 2014: 141-150.
[32] Cao K, Rei M. A joint model for word embedding and word morphology[C]//Proceedings of the 1st Workshop on Representation Learning for NLP, Rep4NLP@ACL, 2016: 18-26.
[33] Lin Z, Liu Y. Implanting rational knowledge into distributed representation at morpheme level[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2019: 2954-2961.
[34] Zhao H. Character-level dependencies inchinese: Usefulness and learning[C]//Proceedings of the 12th Conference of the European Chapter of the ACL, 2009: 879-887.
[35] Dong Z, Dong Q, Hao C. Word segmentation needs change-from a linguist’s view[C]//Proceedings of the 1sth CIPS-SIGHAN Joint Conference on Chinese Language Processing,2010.
[36] Zhang M, Zhang Y, Che W, et al. Character-levelchinese dependency parsing[C] // Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014: 1326-1336.
[37] Li H, Zhang Z, Ju Y. et al. Neural character-level dependency parsing for Chinese[C] // Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018: 5205-5212.
[38] Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units[C] // Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016: 1715-1725.
[39] Schuster M, Nakajima K. Japanese and korean voice search[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2012: 5149-5152.
[40] Kudo T.Subword regularization: Improving neural network translation models with multiple subword candidates[J]. arXiv preprint arXiv: 1804.10959, 2018.
[41] Yang J, Zhang Y, Liang S.Subword encoding in lattice LSTM for chinese word segmentation[J]. arXiv preprint arXiv: 1810.12594, 2018.
[42] Zhang Z, Zhao H, Ling K, et al. Effectivesubword segmentation for text comprehension[J].IEEE/ACM Transactions on Audio, Speech, and Language Processing,2019,27(11): 1664-1674.
[43] Gong C, Li Z, Xia Q, et al. Hierarchical LSTM with char-subword-word tree-structure representation for Chinese named entity recognition[J]. Science China Information Sciences, 2020, 63(10), 1-15.
[44] 方艳, 周国栋. 基于层叠CRF模型的词结构分析[J]. 中文信息学报, 2015,29(04): 1-7,24.
[45] 蒋万伟, 刘娟. 基于条件随机场的词结构分析方法[J]. 武汉大学学报(理学版), 2017,63(03): 251-258.
[46] Graves A,Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures[J]. Neural networks, 2005, 18.5-6, 602-610.
[47] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[J]. arXiv preprint arXiv: 1409.0473.
[48] Bojanowski P, Grave E,Joulin A, et al. Enriching word vectors with subword information[J]. Transactions of the Association for Computational Linguistics 5, 2017, 135-146.
[49] 王钧熙. 汉语新词词典: 2005-2011[M]. 上海: 学林出版社, 2011.

基金

国家自然科学基金(62036001);国家社会科学基金(18ZDA295)
PDF(3888 KB)

Accesses

Citation

Detail

段落导航
相关文章

/