虞宁翌,高琦,恩东. 文言信息的自动抽取: 基于统计和规则的尝试[J]. 中文信息学报, 2015, 29(6): 127-134.
YU Ningyi,AO Gaoqi,UN Endong. A Tentative Study on Statistical and Rule Based Information Extraction #br#
from Ancient Chinese. , 2015, 29(6): 127-134.
A Tentative Study on Statistical and Rule Based Information Extraction #br#
from Ancient Chinese
YU Ningyi2, RAO Gaoqi1,2, XUN Endong2
1. Faculty of Language Sciences, Beijing Language and Culture University, Beijing 100083, China;
2. College of Information Sciences, Beijing Language and Culture University, Beijing 100083, China)
Abstract:The information extraction from ancient Chinese benefits language monitoring and corpus construction. This paper regards the ancient Chinese tagging in mixed corpus as a task of short text classification, and applies both rule methods and statistical methods. For rule based methods, the paper considers the effect from function words and constructions in ancient Chinese. For statistical methods, we conduct experiments on N-gram, Naive Bayes, Maximum Entropy, and Decision Tree. Experiments indicate that the unigram model over performs others in F value of 0.98. The research in this paper also provides evidence for the conclusion on Chinese evolution as a continuum.
Key words ancient Chinese tagging; text classification; rule based model; statistic based model
[1] 王力著.中国语言学史[M].上海: 复旦大学出版社,2007.
[2] 张普.论语言的稳态[J].郑州大学学报(哲学社会科学
版),2008,(02):105-109.
[3] 张普.论语言的动态[J].长江学术,2008,(01):1-9.
[4] 石毓智.汉语发展史上的双音化趋势和动补结构的诞生——语音变化对语法发展的影响[J].语言研究,2002,(02):1-4.
[5] 胡裕树主编.现代汉语[M].上海: 上海教育出版社,1981.
[6] 吕淑湘.现代汉语单双音节问题初探[J].中国语文,1963,1:10-22.
[7] Mihalcea R, Nastase V. Word epoch disambiguation: Finding how words change over time[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, 2012: 259-263.
[8] Popescu O, Strapparava C. Behind the Times: Detecting Epoch Changes using Large Corpora[C]//Proceedings of International Joint Conference on Natural Language Processing. 2013: 347-355.
[9] 荀恩东,饶高琦,谢佳莉,等.现代汉语词汇历时检索系统的建设与应用[J].中文信息学报,2015,29(3):169-176.
[10] 饶高琦,臧娇娇,荀恩东.大数据视角下的语言实证工具: 北语汉语语料库系统 BCC——以因果关系表达的语言模式研究为例[R].北京:北京市语言学年会,2014.
[11] 金观涛,刘青峰.观念史研究[M].北京:法律出版社,2009.
[12] 王力著.古代汉语[M].北京:中华书局, 1964.
[13] 王力著.汉语史稿[M].北京:中华书局, 1980.
[14] Clarkson P. Rosenfeld R. Statistical Language Modeling Using The Cmu-Cambridge Toolkit[C]//Proceedings of Eurospeech. 2000:2707-2710.
[15] McCallum, Andrew Kachites. “MALLET: A Machine Learning for Language Toolkit.”[OL]. http://mallet.cs.umass.edu. 2002.