新词发现,作为自然语言处理的基本任务,是用计算方法研究中国古代文学必不可少的一步。该文提出一种基于古汉语料的新词识别方法,称为AP-LSTM-CRF算法。该算法分为三个步骤。第一步,基于Apache Spark分布式并行计算框架实现的并行化的Apriori改进算法,能够高效地从大规模原始语料中产生候选词集。第二步,用结合循环神经网络和条件随机场的切分概率模型对测试集文档的句子进行切分,产生切分概率的序列。第三步,用结合切分概率的过滤规则从候选词集里过滤掉噪声词,从而筛选出真正的新词。实验结果表明,该新词发现方法能够有效地从大规模古汉语语料中发现新词,在宋词和宋史数据集上分别进行实验,F1值分别达到了89.68%和81.13%,与现有方法相比,F1值分别提高了8.66%和2.21%。
Abstract
New word detection, as a fundamental task in natural language processing, is an indispensable step in the computational study of ancient Chinese literature. In this work, we present an AP-LSTM-CRF model to discover new words in ancient Chinese literature. This model consists of three steps. First, the parallelized improved-Apriori algorithm, implemented on Apache Spark (a distributed parallel computing framework), is used to generate candidate character sequences from large-scale raw corpus. Second, a segmentation model which combines recurrent neural network and conditional random field is used to generate segmentation sequences with probabilities. Third, we design a rule based filter to remove noise words in the candidate character sequences. Experimental results demonstrate that the method is capable of detecting new words in large-scale ancient Chinese corpus effectively. The F1 is up to 89.68% and 81.13% in Song Poetry dataset and History of the Song Dynasty dataset, respectively.
关键词
Apriori的改进算法 /
长短时记忆网络 /
条件随机场 /
过滤规则 /
并行化
{{custom_keyword}} /
Key words
improved-Apriori algorithm /
long short-term memory networks /
conditional random field /
filter rules /
parallelizing
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 黄昌宁,赵海. 中文分词十年回顾[J]. 中文信息学报,2007,21(3):8-19.
[2] Ke Deng,et al. On the unsupervised analysis of domain-specific Chinese texts[J].Proceedings of the National Academy of Sciences of the USA. 2016,113(22):6154-6159.
[3] Chen Ao,Sun Mao-Song. Domain-specific new words detection in Chinese[C]//Proceedings of the 6th Joint Conference on Lexical and Computational Semantics. 2017:44-53.
[4] 霍帅,等. 基于微博内容的新词发现方法[J]. 模式识别与人工智能,2014,27(2):141-145.
[5] 周霜霜,等. 融合规则与统计的微博新词发现方法[J]. 计算机应用,2017,37(4):1044-1050.
[6] 雷一鸣,刘勇,霍华. 面向网络语言基于微博语料的新词发现方法[J]. 计算机工程与设计,2017,38(3):789-794.
[7] 杜丽萍,李晓戈,于根. 基于互信息改进算法的新词发现对中文分词系统改进[J]. 北京大学学报(自然科学版),2016,52(1):35-40.
[8] 陈飞,等. 基于条件随机场方法的开放领域新词发现[J]. 软件学报,2013,24(5):1051-1060.
[9] 杨阳,刘龙飞,魏现辉. 基于词向量的情感新词发现方法[J]. 山东大学学报(理学版),2014,49(11):51-58.
[10] 万琪,等. 利用新词探测提高中文微博的情感表达抽取[J]. 中国科学技术大学学报,2017,47(1):63-69.
[11] Xie Tao,Wu Bin,Wang Bai. New word detection in ancient Chinese literature[C]//Proceedings of the Asia-Pacific Web and Web-Age Information Management Joint Conference on Web and Big Data,2017:260-275.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家“973”重点基础研究发展计划(2013CB329606);国家自然科学基金(61772082);国家社会科学基金(16ZDA055)
{{custom_fund}}