基于字的分布表征的汉语基本块识别

PDF(1185 KB)

中文信息学报 ›› 2014, Vol. 28 ›› Issue (6) : 18-25.

词法·句法·语义分析及应用

基于字的分布表征的汉语基本块识别

李国臣¹,党帅兵²,王瑞波³,李济洪³

作者信息 +

Chinese Base-Chunk Identification Based on Distributed Character Representation

LI Guochen¹, DANG Shuaibing², WANG Ruibo³, LI Jihong³

Author information +

History +

摘要

汉语的基本块识别是汉语句法语义自动分析中的重要任务之一。传统的方法大多数直接将汉语基本块识别任务转化成词层面的一个序列标注问题,采用CRF模型来处理。虽然,在许多评测中得到最好的结果,但基于词为标注单位,在实用中受限于自动分词系统以及汉语词特征的稀疏性。为此,该文给出了一种以字为标注单位,以字为原始输入层,来构建汉语的基本块识别的深层神经网络模型,并通过无监督方法,学习到字的C&W和word2vec两种分布表征,将其作为深层神经网络模型的字的表示层的初始输入参数来强化模型参数的训练。实验结果表明,使用五层神经网络模型,以[-3,3]窗口的字的word2vec分布表征,其准确率、召回率和F值分别达到80.74%,73.80%和77.12%,这比基于字的CRF高出约5%。这表明深层神经网络模型在汉语的基本块识别中是有作用的。

Abstract

Chinese base-chunk identification is an important task for automatically syntactic and semantic analysis. A widely-used strategy is to transform it into a word-level sequence labeling problem, and use models like CRFs to deal with it. Despite its best results in many open evaluations, practical application of such method is limited by accuracy of Chinese word segmentation systems and sparsity of Chinese word features. Therefore, this paper presents a base-chunk identification model based on deep neural network models, which take Chinese character as tagging unit and original input layer. Moreover, Chinese characters C&W distributed representation and word2vec distributed representation are derived through unsupervised learning models, and they are taken as initial input parameters of deep neural network to improve the training procedure. Experimental results show that the precision, recall and F-measure of our final identification model can achieved 80.74%, 73.80% and 77.12%, respectively, conditioned on a five-layer neural network with feature window of size [-3, 3] and word2vec distributed representation.

导出引用

李国臣,党帅兵,王瑞波,李济洪. 基于字的分布表征的汉语基本块识别. 中文信息学报. 2014, 28(6): 18-25

LI Guochen, DANG Shuaibing, WANG Ruibo, LI Jihong. Chinese Base-Chunk Identification Based on Distributed Character Representation. Journal of Chinese Information Processing. 2014, 28(6): 18-25

参考文献

[1] 周强,任海波,孙茂松. 分阶段构建汉语树库[C].
Proceedings of The Second China-Japan Natural Language Processing Joint Research Promotion Conference, 2002: 189-197.
[2] 周强. 基于规则的汉语基本块自动分析器[C].第七届中文信息处理国际会议论文集(ICCC-2007).2007: 137-142.
[3] 宇航,周强. 汉语基本块标注系统的内部关系分析[J]. 清华大学学报,2009, 49(10): 136-140.
[4] 李超,孙健,关毅,徐兴军,侯磊,李生. 基于最大熵模型的汉语基本块分析技术研究[R]. CIPS-ParsEval -2009.
[5] 赵海,揭春雨,宋彦. 基于字依存树的中文词法-句法一体化分析[C].全国第十届计算语言学学术会议(C- NCCL-2009), 2009: 82-88.
[6] 齐璇,王挺,陈火旺. 义类自动标注方法的研究[J]. 中文信息学报,2001,15(3): 9-15.
[7] 吴志媛,钱雪忠 .基于PLSI的标签聚类研究[J]. 计算机应用研究,2013,30(5): 1316-1319.
[8] David M. Blei. Latent Dirichlet Allocation[J].Journal of Machine Learning Research,2003(3): 993-1022.
[9] Ronan Collobert, Jason Weston, Léon Bottou, et al. Natural Language Processing (Almost) from Scratch[J]. Journal of Machine Learning Research (JMLR), 2011(12): 2493-2537.
[10] Tomas Mikolov, Kai Chen, Greg Corrado, et al. Efficient Estimation of Word Representations in Vector Space [R]. arXiv preprint arXiv,2013.
[11] Tomas Mikolov, Ilya Sutskever, Kai Chen, et al. Distributed representations of words and phrases and their compositionality[R]. arXiv preprint arXiv,2013.
[12] Tomas Mikolov,Wen-tau Yih, and Geoffrey Zweig.Linguistic Regularities in Continuous Space Word Repre- sentations[C]//Proceedings of NAACL HLT, 2013.
[13] Yoshua Bengio, Rejean Ducharme, Pascal Vincent, et al. A neural probabilistic language model[J]. Journal of Machine Learning Research (JMLR),2003(3): 1137-1155.
[14] 来斯惟,徐立恒,陈玉博,刘康,赵军. 基于表示学习的中文分词算法探索[J]. 中文信息学报,2013,27(5): 8-14.
[15] 侯潇琪,王瑞波,李济洪. 基于词的分布式实值表示的汉语基本块识别[J]. 中北大学学报(自然科学版).2013,34(5): 582-585.
[16] Turian Joseph, Lev Ratinov, and Yoshua Bengio. Word representations: a simple and general method for sem- i-supervised learning[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL).2010.
[17] Taku Kudo, Yuji Matsumoto. Chunking with support vector machine[C]//Proceedings of the second meeti- ng of North American chapter of association for computational linguistics(NAACL), 2001: 192-199.
[18] John Lafferty, Andrew Mccallum, FernandoPereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of International Conferenceon Machine Learning (ICML 01). Williamstown, MA, USA, 2001: 282-289.
[19] Berger Adam, Stephen Della, Pietra Adam, Vincent Della Pietra. A maximum entropy approach to natural language processing [J]. Computational Linguistics, 1996, 22(1): 39-71.
[20] 张乐. 最大熵工具包MaxEnt(2004版)[CP/OL].2004.http://homepages. inf.ed.ac.uk/s0450736/maxent_ toolkit .html.
[21] Ian J. Goodfellow, David Warde-Farley, Pascal Lamblin, Vincent Dumoulin, Mehdi Mirza, Razvan Pascanu, James Bergstra, Frédéric Bastien, Yoshua Bengio. Pylearn2: a machine learning research library[J]. arXi-v preprint arXiv: 1308.4214.
[22] TakuKudo, CRF++toolkit[CP], 2005. http://crfpp.sourceforge.net/.

基金

国家自然科学基金(60873128);山西省科技基础条件平台建设项目(2013091003-0101)

PDF(1185 KB)

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献
基金

Received	Published
2014-06-20	2014-06-10
Issue Date
2014-06-10

{{custom_sec.title}}

{{custom_sec.title}}

{{custom_fnGroup.title_cn}}

脚注

基金

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注

基金