Language Analysis and Calculation
LIU Yutong, WU Bin, XIE Tao, WANG Bai
2019, 33(1): 46-55.
New word detection, as a fundamental task in natural language processing, is an indispensable step in the computational study of ancient Chinese literature. In this work, we present an AP-LSTM-CRF model to discover new words in ancient Chinese literature. This model consists of three steps. First, the parallelized improved-Apriori algorithm, implemented on Apache Spark (a distributed parallel computing framework), is used to generate candidate character sequences from large-scale raw corpus. Second, a segmentation model which combines recurrent neural network and conditional random field is used to generate segmentation sequences with probabilities. Third, we design a rule based filter to remove noise words in the candidate character sequences. Experimental results demonstrate that the method is capable of detecting new words in large-scale ancient Chinese corpus effectively. The F1 is up to 89.68% and 81.13% in Song Poetry dataset and History of the Song Dynasty dataset, respectively.