在分析中文文本中地名特点的基础上,提出了一种支持向量机(SVM)与规则相结合的中文地名自动识别方法:按字抽取特征向量的属性,然后将这些属性转换成二进制向量并建立训练集,采用多项式Kernel函数,得到SVM识别地名的机器学习模型;通过对错误识别结果的分析,构建规则库对识别结果进行后处理,弥补了机器学习模型获取知识不够全面导致召回率偏低的不足。实验表明,用SVM与规则相结合的机制识别中文文本中的地名是有效的:系统开式召回率、精确率和F-值分别达89.57%、93.52%和91.50%。
Abstract
By analyzing the characteristics of place names in Chinese texts, a method of automatic recognition of Chinese place names is presented, which combining support vector machines (SVMs) with rules. Firstly, feature vectors based on characters are extracted, and transferred into binary vectors. A training set is established, and the machine learning models for automatic identification of Chinese place names are obtained using polynomial kernel functions. Then, through careful error analysis, a rulebase is constructed and a post-processing step based on it is used, to overcome the shortcoming of low recall of machine learning model. The results show that the method is efficient for identifying Chinese place names. In open test, the recall, precision and F-measure reach 89.57% , 93.52% and 91.50% respectively.
关键词
计算机应用 /
中文信息处理 /
中文地名识别 /
支持向量机 /
机器学习 /
基于规则的后处理
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
support vector machines /
Chinese place names recognition /
machine learning /
rule-based post-processing
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 吕雅娟,赵铁军,杨沐昀,等.基于分解与动态规划策略的汉语未登录词识别[J]. 中文信息学报, 2001, 15 (1) : 28 - 33.
[2] 黄德根,杨元生,王省等.基于统计的中文姓名识别[J]. 中文信息学报. 2001, 15 (2) : 31 - 34.
[3] 王振华,孔祥龙,陆汝占,等. 结合决策树方法的中文姓名识别[J]. 中文信息学报. 2004.3, 18 (6) : 10 - 15.
[4] 沈达阳,孙茂松,黄昌宁. 中文地名的自动辨识[A]. 计算语言学进展与应用[C] ,北京:清华大学出版社, 1995, 68 - 74.
[5] 谭红叶, 郑家恒, 刘开瑛. 中国地名自动识别系统的设计与实现[J]. 计算机工程, 2002, 28 (8) : 128 - 129.
[6] 黄德根,岳广玲,杨元生. 基于统计的中文地名识别[J]. 中文信息学报, 2003, 17 (2) : 36 - 41.
[7] TAN Hong-ye, ZHENG Jiang-heng, LIU Kai-ying. Research on Method of Automatic Recognition of Chinese Place Names Based on Transformation[J]. 软件学报, 2001, 12 (11) : 1608 - 1631.
[8] VAPN IK V N. The nature of statistical learning[M]. Berlin: Springer, 1995.
[9] Vapnik, V. N. Statistical Learning Theory[M]. New York: John Wiley & Sons, 1998.
[10] 陈春荣. 基于SVM的中文地名识别[D]. 大连:大连理工大学, 2005.
[11] Hsu Chih-Wei and Lin Chih-Jen. A comparison of methods for multi-class support vector machines[J]. IEEE Transactions on Neural Networks, 2002, 13 (2) : 415 - 425.
[12] 国家测绘局地名研究所. 中国地名录[M]. 第二版. 北京:中国地图出版社, 1997.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60373095,60373096)
{{custom_fund}}