乔维,孙茂松. 汉语交集型歧义切分字段关于专业领域的统计特性[J]. 中文信息学报, 2008, 22(4): 10-18.
QIAO Wei, SUN Mao-song. Statistical Properties of Overlapping Word Segmentation Ambiguities in Domain-specific Chinese Corpora. , 2008, 22(4): 10-18.
汉语交集型歧义切分字段关于专业领域的统计特性
乔维,孙茂松
清华信息科学与技术国家实验室筹,清华大学 计算机科学与技术系,北京 100084
Statistical Properties of Overlapping Word Segmentation Ambiguities in Domain-specific Chinese Corpora
QIAO Wei, SUN Mao-song
Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
Abstract:Overlapping ambiguity is a major type of ambiguity in Chinese word segmentation. The performance of existing word segmentation systems in resolving this ambiguity is still unsatisfactory, especially in dealing with domain-specific texts. In contrast to those quite detailed statistical observations on overlapping ambiguities in general-purpose corpus, similar observations in domain-specific corpus have not been reported in the literature. In terms of a medium-sized general-purpose Chinese wordlist, a general-purpose corpus with over 900 million characters and two domain-specific corpora with total 140 million characters covering 55 domains, statistical properties of high frequent overlapping ambiguities are addressed and studied from several perspectiveswith overlapping ambiguity string from general corpus examined in the domain corpus, and vice versa. It is believed that the finding of this paper will benefit word segmentation disambiguation in particular for domain-specific texts.
[1] 黄昌宁.中文信息处理中的分词问题[J].语言文字应用,1997,17(1):74-80. [2] 梁南元.书面汉语自动分词系统─CDWS[J].中文信息学报,1987,1(2):44-52. [3] 刘开瑛.现代汉语自动分词评测技术研究[J].语言文字应用,1997,21(1): 101-106. [4] 李国杰,等.第五届全国汉字识别、语言识别与合成系统及自然语言处理系统评测结果[J].智能机研究动态,1998,(4):43-46. [5] 孙茂松,等.利用汉字二元语法关系解决汉语自动分词中的交集型歧义[J].计算机研究与发展,1997,34(5):332-339. [6] 孙茂松,左正平,黄昌宁.消解中文三字长交集型分词歧义的算法[J].清华大学学报.1999, 39(5):101-103. [7] Swen Bing and Shiwen Yu. A graded approach for the efficient resolution of Chinese word segmentation ambiguities [C]//Proceedings of 5th Natural Language Processing Pacific Rim Symposium, 1999, 19-24. [8] 侯敏,等. 汉语自动分词中的上下文相关歧义字段(CSAS)研究[C]//全国第八届计算语言学联合学术会议(JSCL-2005)论文集.南京:清华大学出版社,2005:214-220. [9] 罗智勇,宋柔.现代汉语通用分词系统中歧义切分的实用技术[J].计算机研究与发展., 2006, 43(6): 1122-1128. [10] 王思力,王斌.基于双字耦合度的中文分词交叉歧义处理方法[J].中文信息学报, 2007,21(5): 14-17. [11] 李蓉,等. 基于SVM 和K-NN 结合的汉语交集型歧义切分方法[J].中文信息学报, 2001,15(6): 13-18. [12] 王伟,钟义信.一种基于EM非监督训练的自组织分词歧义解决方案[J].中文信息学报, 2001,15(2): 38-44. [13] Qin Ying, Zhang Suxiang and Wang Xiaojie. Combining Multi-knowledge for Chinese Word Segmentation Disambiguation [C]//Proceedings of the Sixth International Conference on Intelligent Systems Design and Applications (ISDA’06), 2006, 551-556. [14] Zhang Feng and Fan Xiaozhong. Resolution of overlapping ambiguity strings based on maximum entropy model [J]. Frontiers of Electrical and Electronic Engineering in China, 2006, 1(3): 273-276. [15] 孙茂松,左正平.汉语真实文本中的交集型切分歧义[C]//汉语计量与计算研究.香港:香港城市大学出版社,1998:323-338. [16] 孙茂松,左正平,邹嘉彦.高频最大交集型歧义切分字段在汉语自动分词中的应用[J].中文信息学报.1999,13(1):27-34. [17] Mu Li et al. Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation [C]//Proceedings of SIGHAN’2003, 2003, 1-7. [18] 李斌,等.基于语料库的高频最大交集型歧义字段考察[J].中文信息学报, 2006,20(1): 1-6. [19] 俞士汶,等.现代汉语语法信息词典详解(第二版)[M].北京: 清华大学出版社,2003.