饶高琦;李宇明. 基于70年报刊语料的现代汉语历时稳态词抽取与考察[J]. 中文信息学报, 2016, 30(6): 49-58.
RAO Gaoqi; LI Yuming. Extraction and Investigation of State Steady Words from 70 Years Newspapers. , 2016, 30(6): 49-58.
Extraction and Investigation of State Steady Words from 70 Years Newspapers
RAO Gaoqi1; LI Yuming2
1. Center for Studies of Chinese as a Second Language, Beijing Language and Culture University, Beijing 100083, China;
2. Institute for Chinese Language Policies and Standards, Beijing Language and Culture University, Beijing 100083, China
Abstract:Based on the diachronic corpus of modern Chinese newspaper across 70 years, statistical measures are applied to detect the state-steady words. Altogether, 3 013 words are decided as the candidates according to their corpus coverage, time sensitivity and diachronic classification. Among them, verbs and nouns cover one third, respectively, and the rest consists of adjectives and function words. The average word length is 1.7 characters, distributed within top 7 609 in frequency list, and covering 90% of corpus. Basic morphemes and core words shape the features of the set in POS and length.
[1] 张普.论语言的稳态[J].郑州大学学报(哲学社会科学版),2008(02):105-109.
[2] Fukumoto F, Suzuki Y, Takasu A. Timeline adaptation for text classification[C]//Proceedings of ACM International Conference on Information & Knowledge Management. 2013: 1517-1520.
[3] Degaetanoortlieb S. Feature Discovery for Diachronic Register Analysis: a Semi-Automatic Approach[C]//Proceedings of International Conference on Language Resources and Evaluation (LREC′12). 2012: 2786-2790.
[4] 谢晓燕. 基于26年《深圳特区报》的稳态词语提取与考察研究[D]. 北京语言大学博士学位论文,2010.
[5] 荀恩东,饶高琦,肖晓悦,等. 大数据背景下BCC语料库的研制[J]. 语料库语言学,2016,3(1): 93-118.
[6] 荀恩东,饶高琦,谢佳莉,等. 现代汉语词汇历时检索系统与应用研究[J],中文信息学报,2015(3): 169-176.
[7] K Sparck-Jones. A statistical interpretation of term specificity and its application in retrieval[J]. Journal of documentation, 1972, 28(1): 11-21.
[8] S E Robertson, K S Jones. Relevance weighting of search terms[J]. Journal of American Society of Information Science, 27(3): 129-146.
[9] C E Shannon, A mathematical theory of communication[J]. Bell System Technical Journal, 1948,27: 379-423,623-656.
[10] T M Cover, J A Thomas, Elements of Information Theory[M]. John Wiley & Sons, New Jersey.1991: 96-99.
[11] Xu Y, Jones G J F, Li J T, et al. A study on mutual information-based feature selection for text categorization[J]. Journal of Computational Information Systems, 2007, 3(3): 1007-1012.
[12] 顾益军, 樊孝忠, 王建华,等. 中文停用词表的自动选取[J]. 北京理工大学学报, 2005, 25(4): 337-340.
[13] 关高娃. 蒙古文停用词和英文停用词比较研究[J]. 中文信息学报, 2011, 25(4): 35-38.
[14] Lo T W, He B, Ounis I. Automatically Building a Stopword List for an Information Retrieval System.[J]. Journal of Digital Information Management, 2005, 3(1): 3-8.
[15] 冯志伟, 胡凤国. 数理语言学[M]. 北京: 商务印书馆, 2012: 255.
[16] I Rosengren. The quantitive concept of language and its relation to the structure of frequency dictionaries[J]. Etudes de Linguistiques Applique, 1971(1): 103-127.
[17] Huarui Zhang, Churen Huang, Shiwen Y. Distributional Consistency: A general method for defining a core lexicon[C]//Proceedings of International Conference on Language Resources and Evaluation (LREC′04),2004.
[18] 教育部语言文字信息管理司. 中国语言生活状况报告[M],北京: 商务印书馆,2015.
[19] Ian H Witten, Eibe Frank, Mark A Hall. Data Mining: Practical Machine Learning Tools and Techniques (3rd Edition)[M]. Burlington, Massachusetts: Press Morgan Kaufmann.2005: 151-162.
[20] 国家汉语水平考试委员会《汉语水平词汇等级大纲》[M],北京: 经济科学出版社,2001.