面向专利文献的中文分词技术的研究

张桂平,刘东生,尹宝生,徐立军,苗雪雷

PDF(623 KB)
PDF(623 KB)
中文信息学报 ›› 2010, Vol. 24 ›› Issue (3) : 112-117.
综述

面向专利文献的中文分词技术的研究

  • 张桂平,刘东生,尹宝生,徐立军,苗雪雷
作者信息 +

Research on Chinese Word Segmentation for Patent Documents

  • ZHANG Guiping, LIU Dongsheng, YIN Baosheng, XU Lijun, MIAO Xuelei
Author information +
History +

摘要

针对专利文献的特点,该文提出了一种基于统计和规则相结合的多策略分词方法。该方法利用文献中潜在的切分标记,结合切分文本的上下文信息进行最大概率分词,并利用术语前后缀规律进行后处理。该方法充分利用了从大规模语料中获取的全局信息和切分文本的上下文信息,有效地解决了专利分词中未登录词难以识别问题。实验结果表明,该文方法在封闭和开放测试下分别取得了较好的结果,对未登录词的识别也有很好的效果。

Abstract

According to the characteristics of the patent documents, this paper presents a multi-strategy approach for word segmentation based on statistics and rules. Our method takes advantage of the latent segmentation-marks in the document and employs the context information of the text in the a maximum probabilistic model of segmentation. Meanwhile, the term affix rules are applied in the post-processing. Making full use of the global information from a large scale corpus and the specific context information, this method effectively solves the problem of the out-of-vocabulary words difficult to identify in the patent segmentation. The experimental results indicate that this method achieves good results in the close and opening test, with improves on unknown words recognition as well.
Key wordscomputer application; Chinese information processing; Chinese word segmentation; patent document; context information

关键词

计算机应用 / 中文信息处理 / 中文分词 / 专利文献 / 上下文信息

Key words

computer application / Chinese information processing / Chinese word segmentation / patent document / context information
 
/   /   /
 
/   /   /
 
/   /  

引用本文

导出引用
张桂平,刘东生,尹宝生,徐立军,苗雪雷. 面向专利文献的中文分词技术的研究. 中文信息学报. 2010, 24(3): 112-117
ZHANG Guiping, LIU Dongsheng, YIN Baosheng, XU Lijun, MIAO Xuelei. Research on Chinese Word Segmentation for Patent Documents. Journal of Chinese Information Processing. 2010, 24(3): 112-117

参考文献

[1] 陈燕,黄迎燕,方建国.专利信息采集与分析[M]. 北京: 清华大学出版社. 2006.
[2] 赵铁军,吕雅娟,于浩,杨沐昀,刘芳.提高汉语自动分词精度的多步处理策略[J]. 中文信息学报, 2001, 15(1): 13-18.
[3] 黄昌宁,赵海.中文分词十年回顾[J], 中文信息学报, 2007, 21(3): 8-20.
[4] 宗成庆.统计自然语言处理[M].北京: 清华大学出版社. 2008, 5.
[5] 张春霞,郝天永.汉语自动分词的研究现状与困难[J]. 系统仿真学报, 2005(01): 138-147.
[6] 黄昌宁.统计语言模型能做什么?[J]. 语言文字应用, 2002,(1) : 77-84.
[7] 刘挺,吴岩,王开铸.最大概率分词问题及其解法[J]. 哈尔滨工业大学学报, 1998,(12): 37-41.
[8] 孙茂松,肖明,邹嘉彦.基于无指导学习策略的无词表条件下的汉语自动分词[J]. 计算机学报, 2004,27(6):736-742.
[9] ZHANG HP. Chinese Lexical Analysis Using Hierarchical Hidden Markov Model[C]//Second SIGHAN workshop affiliated with 41th ACL2003: 63-70.
[10] Neinwen Xue and Susan P. Converse. Combining classifiers for Chinese word segmentation[C]//Proceedings of the First SIGHAN Workshop on Chinese Language Processing Taipei, Taiwan: 2002: 63-270.
[11] Zhao Hai, Huang Changning, Li Mu. An Improved Chinese Word Segmentation System with Conditional Random Field [C]//Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing. Sydney, 2006: 196-199.
[12] Richard Sproat, Thomas Emerson. The first international Chinese word segmentation bakeoff [C]//The First SIGHAN Workshop Attached with the ACL2003. Sapporo, Japan, 2003: 133-143.
[13] Xue NW. Chinese word segmentation as character tagging [J].Computational Linguistics and Chinese Language Processing, 2003,8(1): 29-48.
[14] 沈达阳,孙茂松,黄昌宁.汉语自动分词和词性标注一体化系统[J]. 中文信息, 1996,(5): 17-19.
[15] 北京大学计算语言学研究所[CP/OL]. http://icl.pku.edu.cn/icl_res/segtag98/,1998.
[16] 冯志伟.现代术语学引论[M]. 北京: 语文出版社:1997: 1-1.
[17] 刘挺,吴岩.串频统计和词形匹配相结合的汉字自动分词系统[J]. 中文信息学报, 1997,12(1) : 17-22.
[18] Manber Udi, Gene Myers, Suffix arrays: a new method for on-line string searches [J]. SIAM Journal on Computing, 1993, 22(5):935-948.
[19] 宋彦, 蔡东风, 张桂平, 赵海. 一种基于字词联合解码的中文分词方法[J]. 软件学报,2009, 20(9):2366-2375.

基金

国家自然科学基金资助项目(60842005);辽宁省教育厅科技研究资助项目(2007T139)
PDF(623 KB)

689

Accesses

0

Citation

Detail

段落导航
相关文章

/