文言信息的自动抽取有利于语言监测和语料库构建。同时该文的计算研究也验证了语言学界关于汉语文白系统连续性的自省结论。该文将从混合语料中标注文言文的问题视为短文本分类的问题进行处理。使用基于规则和基于统计的方法对文言文、白话文本进行分类。在基于规则的方法中,考虑文言常用虚词和句式的影响,对N-gram、朴素贝叶斯、最大熵、决策树模型的性能进行了研究。结果表明监测虚词系统的一元语言模型的F值达到了0.98。
Abstract
The information extraction from ancient Chinese benefits language monitoring and corpus construction. This paper regards the ancient Chinese tagging in mixed corpus as a task of short text classification, and applies both rule methods and statistical methods. For rule based methods, the paper considers the effect from function words and constructions in ancient Chinese. For statistical methods, we conduct experiments on N-gram, Naive Bayes, Maximum Entropy, and Decision Tree. Experiments indicate that the unigram model over performs others in F value of 0.98. The research in this paper also provides evidence for the conclusion on Chinese evolution as a continuum.
Key words ancient Chinese tagging; text classification; rule based model; statistic based model
关键词
文言标注 /
文本分类 /
规则模型 /
统计模型
{{custom_keyword}} /
Key words
ancient Chinese tagging /
text classification /
rule based model /
statistic based model
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 王力著.中国语言学史[M].上海: 复旦大学出版社,2007.
[2] 张普.论语言的稳态[J].郑州大学学报(哲学社会科学
版),2008,(02):105-109.
[3] 张普.论语言的动态[J].长江学术,2008,(01):1-9.
[4] 石毓智.汉语发展史上的双音化趋势和动补结构的诞生——语音变化对语法发展的影响[J].语言研究,2002,(02):1-4.
[5] 胡裕树主编.现代汉语[M].上海: 上海教育出版社,1981.
[6] 吕淑湘.现代汉语单双音节问题初探[J].中国语文,1963,1:10-22.
[7] Mihalcea R, Nastase V. Word epoch disambiguation: Finding how words change over time[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, 2012: 259-263.
[8] Popescu O, Strapparava C. Behind the Times: Detecting Epoch Changes using Large Corpora[C]//Proceedings of International Joint Conference on Natural Language Processing. 2013: 347-355.
[9] 荀恩东,饶高琦,谢佳莉,等.现代汉语词汇历时检索系统的建设与应用[J].中文信息学报,2015,29(3):169-176.
[10] 饶高琦,臧娇娇,荀恩东.大数据视角下的语言实证工具: 北语汉语语料库系统 BCC——以因果关系表达的语言模式研究为例[R].北京:北京市语言学年会,2014.
[11] 金观涛,刘青峰.观念史研究[M].北京:法律出版社,2009.
[12] 王力著.古代汉语[M].北京:中华书局, 1964.
[13] 王力著.汉语史稿[M].北京:中华书局, 1980.
[14] Clarkson P. Rosenfeld R. Statistical Language Modeling Using The Cmu-Cambridge Toolkit[C]//Proceedings of Eurospeech. 2000:2707-2710.
[15] McCallum, Andrew Kachites. “MALLET: A Machine Learning for Language Toolkit.”[OL]. http://mallet.cs.umass.edu. 2002.
[16] 徐通锵,叶蜚声.语言学概论[M].北京: 北京大学出版社,1981.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61300081,61170162);国家高技术研究发展计划(2015AA015409);国家社会科学重大基金(12&ZD173)
{{custom_fund}}