该文基于70年跨度的历时报刊语料库,使用九种统计方法计算了词语历年的使用情况,并通过对稳定性、覆盖度和时间区分性能的考察筛选获得了规模为3 013词的历时稳态词候选词集。该词集中动词与名词各占约三分之一(其余为形容词、副词与虚词),平均词长约1.7字,前密后疏地分布于历时语料库总频序表的前7 609位,覆盖了总语料的近九成。该部分词语中包含大量构造句子结构的核心词语。它们塑造了稳态词在词长和词类上的特性。稳态词的提取可以加深对语言生活底层与基础词汇的认识,对汉语教学、中文信息处理和语言规划都具有重要意义。
Abstract
Based on the diachronic corpus of modern Chinese newspaper across 70 years, statistical measures are applied to detect the state-steady words. Altogether, 3 013 words are decided as the candidates according to their corpus coverage, time sensitivity and diachronic classification. Among them, verbs and nouns cover one third, respectively, and the rest consists of adjectives and function words. The average word length is 1.7 characters, distributed within top 7 609 in frequency list, and covering 90% of corpus. Basic morphemes and core words shape the features of the set in POS and length.
关键词
稳态词 /
历时语料库 /
语言监测
{{custom_keyword}} /
Key words
steady-state word /
diachronic corpus /
language monitoring
/
/
/
/
/
/
/
/
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 张普.论语言的稳态[J].郑州大学学报(哲学社会科学版),2008(02):105-109.
[2] Fukumoto F, Suzuki Y, Takasu A. Timeline adaptation for text classification[C]//Proceedings of ACM International Conference on Information & Knowledge Management. 2013: 1517-1520.
[3] Degaetanoortlieb S. Feature Discovery for Diachronic Register Analysis: a Semi-Automatic Approach[C]//Proceedings of International Conference on Language Resources and Evaluation (LREC′12). 2012: 2786-2790.
[4] 谢晓燕. 基于26年《深圳特区报》的稳态词语提取与考察研究[D]. 北京语言大学博士学位论文,2010.
[5] 荀恩东,饶高琦,肖晓悦,等. 大数据背景下BCC语料库的研制[J]. 语料库语言学,2016,3(1): 93-118.
[6] 荀恩东,饶高琦,谢佳莉,等. 现代汉语词汇历时检索系统与应用研究[J],中文信息学报,2015(3): 169-176.
[7] K Sparck-Jones. A statistical interpretation of term specificity and its application in retrieval[J]. Journal of documentation, 1972, 28(1): 11-21.
[8] S E Robertson, K S Jones. Relevance weighting of search terms[J]. Journal of American Society of Information Science, 27(3): 129-146.
[9] C E Shannon, A mathematical theory of communication[J]. Bell System Technical Journal, 1948,27: 379-423,623-656.
[10] T M Cover, J A Thomas, Elements of Information Theory[M]. John Wiley & Sons, New Jersey.1991: 96-99.
[11] Xu Y, Jones G J F, Li J T, et al. A study on mutual information-based feature selection for text categorization[J]. Journal of Computational Information Systems, 2007, 3(3): 1007-1012.
[12] 顾益军, 樊孝忠, 王建华,等. 中文停用词表的自动选取[J]. 北京理工大学学报, 2005, 25(4): 337-340.
[13] 关高娃. 蒙古文停用词和英文停用词比较研究[J]. 中文信息学报, 2011, 25(4): 35-38.
[14] Lo T W, He B, Ounis I. Automatically Building a Stopword List for an Information Retrieval System.[J]. Journal of Digital Information Management, 2005, 3(1): 3-8.
[15] 冯志伟, 胡凤国. 数理语言学[M]. 北京: 商务印书馆, 2012: 255.
[16] I Rosengren. The quantitive concept of language and its relation to the structure of frequency dictionaries[J]. Etudes de Linguistiques Applique, 1971(1): 103-127.
[17] Huarui Zhang, Churen Huang, Shiwen Y. Distributional Consistency: A general method for defining a core lexicon[C]//Proceedings of International Conference on Language Resources and Evaluation (LREC′04),2004.
[18] 教育部语言文字信息管理司. 中国语言生活状况报告[M],北京: 商务印书馆,2015.
[19] Ian H Witten, Eibe Frank, Mark A Hall. Data Mining: Practical Machine Learning Tools and Techniques (3rd Edition)[M]. Burlington, Massachusetts: Press Morgan Kaufmann.2005: 151-162.
[20] 国家汉语水平考试委员会《汉语水平词汇等级大纲》[M],北京: 经济科学出版社,2001.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家社科基金(12&ZD173);国家语委科研项目(YB125-42、ZDI135-3);863计划重点项目(SQ2015AA0100074);国家社科基金(16AYY007);教育部人文社科重点研究基地重大项目(16JJD740004)
{{custom_fund}}