该文收集了自晚清到21世纪间长达144年的连续历时报刊语料,通过统计分析和词语分布式表示两类方法展开研究,计算并辅助识别汉语词语的词义历时演变现象。采用TF-IDF、词频比例等多种统计分析的评价指标和目标词语在文段中的共现实词及其重合度挖掘出现词义演变的词语。针对历时语料上不同时间段的词向量对齐,采用SGNS训练词向量加正交矩阵投影、SGNS递增训练和“锚点词”二阶词向量表示三种方法,其中以SGNS递增训练效果最佳。针对自动发现的词义演变现象,采用目标词历时自相似度和锚点词历时相似度的分析方法,并利用近邻词来明确目标词变迁前后的词义。
Abstract
This paper collected a diachronic corpus of Chinese newspapers and periodicals for the past 144 years dated back to the late Qing Dynasty. A study on word semantic evolution computation is conducted for Chinese via statistical analysis and word distributed representation. Chinese word with potential semantic evolution is first discovered by context overlapping of content words via TF-IDF, word frequency ratio and other statistical indicators. Then, to align the word embeddings derived from corpus of different time periods, three methods are examined: orthogonal matrix alignment after SGNS training, second-order word vector representation and SGNS incremental training (which bears top performance). Finally, the word semantic evolution is identified by the diachronic self-similarity of the candidate word and the diachronic similarity of anchor words, with neighboring words as the description of the word meaning in the evolution.
关键词
词义演变 /
历时语料 /
分布式表示
{{custom_keyword}} /
Key words
word semantic evolution /
diachronic corpus /
distributed representation
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 吴福祥.汉语语义演变研究的回顾与前瞻[J].古汉语研究,2015(04):2-13,95.
[2] 刘永厚.汉语社会称谓语的语义演变[M].北京: 知识产权出版社,2017.
[3] 李邵唐.古今词义演变举隅[M].北京: 语文出版社,2017.
[4] 王惠. 词义·词长·词频——《现代汉语词典》(第5版)多义词计量分析[J]. 中国语文,2009,(02):120-130,191.
[5] 贾佳. 《儒林外史》词汇在现代汉语中的变化考察[D].保定:河北大学硕士学位论文,2010.
[6] 金观涛,刘青峰.观念史研究: 中国现代重要政治术语的形成[M].北京: 法律出版社,2009
[7] 饶高琦,李宇明.基于词汇聚类方法的现代汉语分期与分期体系构建[J].中文信息学报,2017,31(06):18-24.
[8] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems, ACM, 2013: 3111-3119.
[9] Turney P D, Pantel P. From frequency to meaning: Vector space models of semantics[J]. Journal of Artificial Intelligence Research, 2010, 37: 141-188.
[10] Baroni M, Dinu G, Kruszewski G. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors[C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2014, 1: 238-247.
[11] Kulkarni V, Al-Rfou R, Perozzi B, et al. Statistically significant detection of linguistic change[C]//Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2015: 625-635.
[12] Zhang Y, Jatowt A, Bhowmick S, et al. Omnia mutantur, nihil interit: Connecting past with present by finding corresponding terms across time[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015, 1: 645-655.
[13] Zhang Y, Jatowt A, Bhowmick S S, et al. The past is not a foreign country: Detecting semantically similar terms across time[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(10): 2793-2807.
[14] Calude A S, Pagel M. How do we use language? Shared patterns in the frequency of word use across 17 world languages[J]. Philosophical Transactions of the Royal Society B: Biological Sciences, 2011, 366(1567): 1101-1107.
[15] Hamilton W L, Leskovec J, Jurafsky D. Diachronic word embeddings reveal statistical laws of semantic change[J]. arXiv preprint arXiv:1605.09096, 2016.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
教育部人文社科基金(20YJC740050);北京语言大学青年英才培养计划(1090/501321102);北京语言大学中央高校基本科研业务费(19YJ130005)
{{custom_fund}}