郑婳,刘扬,殷雅琦,王悦,代达劢. 基于词信息嵌入的汉语构词结构识别研究[J]. 中文信息学报, 2022, 36(5): 31-40,66.
ZHENG Hua, LIU Yang, YIN Yaqi, WANG Yue, DAI Damai. Chinese Word-Formation Prediction Based on Lexical Level Embedding. , 2022, 36(5): 31-40,66.
Abstract:As a paratactic language, Chinese word-formations designate how the formation components combine to form words and become the key to understand semantics. In Chinese Natural Language Processing, most existing works on word-formation prediction follow the coarse-grained syntactic labels and use inter-word features in the context, regardless of the inner-word features like morphemes and lexical semantics. In this paper, we follow the word-formation labels defined from the linguistic perspective and construct a formation-informed Chinese dataset. We then propose a Bi-LSTM-based model with self-attention to explore how the inner- and inter-word features influence the Chinese word-formation prediction. Experimental results show that our method achieves high accuracy (77.87%) and F1 score (78.36%) on the word-formation task. Comparative analyses further show that morphemes (as an inner-word feature) greatly improve the prediction results, whereas the context (as an inter-word feature) performs the worst and shows strong instability.
[1] 马建忠. 马氏文通[M]. 北京: 商务印书馆, 1898. [2] 赵元任. 中国话的文法[M]. 香港: 香港中文大学出版社, 1980. [3] 朱德熙. 语法讲义[M]. 北京: 商务印书馆, 1982. [4] 谭景春. 词的意义、结构的意义与词典释义[J]. 中国语文, 2000, 02: 10-18. [5] 曹炜. 现代汉语词义学[M]. 上海: 学林出版社, 2001. [6] 苏宝荣. 词(语素)义与结构义[J]. 语文研究, 2001, 01: 1-5. [7] 杨梅. 现代汉语合成词构词研究[D]. 南京: 南京大学博士学位论文, 2006. [8] 吉志薇, 冯敏萱. 面向普通未登录词理解的二字词语义构词研究[J]. 中文信息学报, 2015,63(03): 251-258. [9] 田元贺, 刘扬. 汉语未登录词的词义知识表示及语义预测[J]. 中文信息学报, 2016,30(6): 26-34. [10] 陈龙, 饶琪, 刘扬. 汉语词的非字面义的表示与应用[J]. 中国科学: 信息科学, 2009: 49: 1005-1018. [11] Zheng H, Li L, Dai D, et al. Leveraging Word-Formation Knowledge for Chinese Word Sense Disambiguation [C]//Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, 2021: 918-923. [12] Zheng H, Dai D, Li L, et al. Decompose,fuse and generate: A formation-informed method for chinese definition generation [C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021: 5524-5531. [13] Li Z. Parsing the internal structure of words: A new paradigm forchinese word segmentation[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human language technologies. 2011: 1405-1414. [14] Zhang M, Zhang Y, Che W, et al. Chinese parsing exploiting characters[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 2013: 125-134. [15] 孙静, 方艳, 丁彬, 等. 利用扩展标记集的词结构分析[J]. 中文信息学报, 2014,28(05): 39-45+82. [16] Zheng X, Chen H, Xu T. Deep learning for Chinese word segmentation and POS tagging[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2013: 647-657. [17] Gui T, Zhang Q, Gong J, et al. Transferring from formal newswire domain with hypernet for twitter pos tagging[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018: 2540-2549. [18] Wang Y, Wang M, Fujita H. Word sense disambiguation: A comprehensive knowledge exploitation framework[J]. Knowledge-Based Systems, 2020, 190, p.105030. [19] 刘扬, 林子, 康司辰. 汉语的语素概念提取与语义构词分析[J]. 中文信息学报, 2018,32(02): 12-21. [20] 郭绍虞. 汉语语法修辞新探[M]. 北京: 商务印书馆, 1979. [21] 陆志韦. 汉语的构词法[M]. 北京: 科学出版社, 1963. [22] 王洪君. 汉语语法的基本单位与研究策略[J]. 语言教学与研究,2000,02: 10-18. [23] 张国宪. 并列式合成词的语义构词原则与中国传统文化[J]. 汉语学习, 1992,(05): 28-31. [24] 朱彦. 汉语复合词语义构词法研究[D]. 上海: 华东师范大学博士学位论文, 1982. [25] 刘叔新. 汉语描写词汇学[M]. 北京: 商务印书馆, 1990. [26] 徐通锵. 核心字和汉语的语义构辞法研究[J]. 语文研究, 1997,03: 2-16. [27] 傅爱平. 汉语信息处理中单字的构词方式与合成词的识别和理解[J]. 语言文字应用,2003,(04): 25-33. [28] 苑春法, 黄昌宁.基于语素数据库的汉语语素及构词研究[J]. 世界汉语教学, 1998,(02): 8-13. [29] 尹斌庸. 汉语语素的定量研究[J]. 中国语文, 1984,(05): 338-347. [30] 徐枢. 语素[M]. 北京: 人民教育出版社, 1990. [31] Qiu S, Qing C, Jiang B, et al. Co-learning of word representations and morpheme representations[C]//Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers, 2014: 141-150. [32] Cao K, Rei M. A joint model for word embedding and word morphology[C]//Proceedings of the 1st Workshop on Representation Learning for NLP, Rep4NLP@ACL, 2016: 18-26. [33] Lin Z, Liu Y. Implanting rational knowledge into distributed representation at morpheme level[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2019: 2954-2961. [34] Zhao H. Character-level dependencies inchinese: Usefulness and learning[C]//Proceedings of the 12th Conference of the European Chapter of the ACL, 2009: 879-887. [35] Dong Z, Dong Q, Hao C. Word segmentation needs change-from a linguist’s view[C]//Proceedings of the 1sth CIPS-SIGHAN Joint Conference on Chinese Language Processing,2010. [36] Zhang M, Zhang Y, Che W, et al. Character-levelchinese dependency parsing[C] // Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014: 1326-1336. [37] Li H, Zhang Z, Ju Y. et al. Neural character-level dependency parsing for Chinese[C] // Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018: 5205-5212. [38] Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units[C] // Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016: 1715-1725. [39] Schuster M, Nakajima K. Japanese and korean voice search[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2012: 5149-5152. [40] Kudo T.Subword regularization: Improving neural network translation models with multiple subword candidates[J]. arXiv preprint arXiv: 1804.10959, 2018. [41] Yang J, Zhang Y, Liang S.Subword encoding in lattice LSTM for chinese word segmentation[J]. arXiv preprint arXiv: 1810.12594, 2018. [42] Zhang Z, Zhao H, Ling K, et al. Effectivesubword segmentation for text comprehension[J].IEEE/ACM Transactions on Audio, Speech, and Language Processing,2019,27(11): 1664-1674. [43] Gong C, Li Z, Xia Q, et al. Hierarchical LSTM with char-subword-word tree-structure representation for Chinese named entity recognition[J]. Science China Information Sciences, 2020, 63(10), 1-15. [44] 方艳, 周国栋. 基于层叠CRF模型的词结构分析[J]. 中文信息学报, 2015,29(04): 1-7,24. [45] 蒋万伟, 刘娟. 基于条件随机场的词结构分析方法[J]. 武汉大学学报(理学版), 2017,63(03): 251-258. [46] Graves A,Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures[J]. Neural networks, 2005, 18.5-6, 602-610. [47] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[J]. arXiv preprint arXiv: 1409.0473. [48] Bojanowski P, Grave E,Joulin A, et al. Enriching word vectors with subword information[J]. Transactions of the Association for Computational Linguistics 5, 2017, 135-146. [49] 王钧熙. 汉语新词词典: 2005-2011[M]. 上海: 学林出版社, 2011.