舒蕾,郭懿鸾,王慧萍,张学涛,胡韧奋. 古汉语词义标注语料库的构建及应用研究[J]. 中文信息学报, 2022, 36(5): 21-30.
SHU Lei, GUO Yiluan, WANG Huiping, ZHANG Xuetao , HU Renfen. The Construction and Application of Ancient Chinese Corpus with Word Sense Annotation. , 2022, 36(5): 21-30.
The Construction and Application of Ancient Chinese Corpus with Word Sense Annotation
SHU Lei1,3, GUO Yiluan2,3, WANG Huiping1,3, ZHANG Xuetao1,2, HU Renfen1,3
1.Institute of Chinese Information Processing, Beijing Normal University, Beijing 100875, China; 2.Institute for Advanced Study of the Humanities and Religion, Beijing Normal University, Beijing 100875, China; 3.College of Chinese Language and Culture, Beijing Normal University, Beijing 100875, China
Abstract:Due to the dominant monosyllabic words, polysemy is a challenge for modern people to understand the ancient Chinese. Based on the linguistic knowledge in traditional dictionaries, this paper designs the principles of semantic division of polysemous words in ancient Chinese, and categorizes the knowledge of popular monosyllabic words in ancient Chinese. With these guidelines, the annotated corpus has accumulated up to 38 700 sentences with more than1 176 000 Chinese characters. Experiments show that the accuracy of BERT based word sense disambiguation model trained on the corpus achieves about 80%. Furthermore, this paper explores the application of the corpus built and the technique of word sense disambiguation in the study of language ontology and dictionary compilation via diachronic evolution analysis of word meaning and the induction of sense families.
[1] 金澎,吴云芳,俞士汶.词义标注语料库建设综述[J].中文信息学报, 2008: 22(3):16-23. [2] Wu D, Su W, Carpuat M. A kernel PCA method for superior word sense disambiguation[C]// Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, 2004: 637-644. [3] Palmer M, Dang H T, Fellbaum C. Making fine-grained and coarse-grained sense distinctions, both manually and automatically[J]. Natural Language Engineering, 2007, 13(2): 137-163. [4] Chan Y S, Ng H T, Chiang D. Word sense disambiguation improves statistical machine translation[C]// Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, 2007: 33-40. [5] Hu R, Li S, Liang S. Diachronic sense modeling with deep contextualized word embeddings: An ecological view[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 3899-3908. [6] 吴云芳, 俞士汶. 信息处理用词语义项区分的原则和方法[J]. 语言文字应用, 2006, (2): 126-133. [7] Huang C, Chen C, Weng C, et al. The sinica sense management system: design and implementation[J]. International Journal of Computational Linguistics and Chinese Language Processing, 2005, 10 (4): 417-430. [8] 肖航, 杨丽姣. 基于词典的语料库词义标注研究[J]. 语言文字应用, 2010, (2): 135-141. [9] 王敬,杨丽姣,蒋宏飞等. 汉语二语教学领域词义标注语料库的研究及构建[J]. 中文信息学报, 2017, 31 (1): 221-229. [10] 张永言.词汇学简论[M]. 上海: 复旦大学出版社, 2015: 50. [11] 胡韧奋, 李绅, 诸雨辰. 基于深层语言模型的古汉语知识表示及自动断句研究[J]. 中文信息学报, 2021, 35(04): 8-15. [12] Tahmasebia N, Borina L, Jatowtb A. Survey of computational approaches to lexical semantic change detection[J]. Computational Approaches to Semantic Change, 2021, 6: 1-91.