李斌,陈小荷. 面向中文陌生文本的人机交互式分词方法[J]. 中文信息学报, 2007, 21(3): 92-98.
LI Bin, CHEN Xiao-he. A Human-Computuer Interaction Word Segmentation Method Adapting to Chinese Unknown Texts. , 2007, 21(3): 92-98.
面向中文陌生文本的人机交互式分词方法
李斌,陈小荷
南京师范大学 文学院,江苏 南京 210097
A Human-Computuer Interaction Word Segmentation Method Adapting to Chinese Unknown Texts
LI Bin, CHEN Xiao-he
School of Chinese Language and Literature, Nanjing Normal University, Nanjing, Jiangsu 210097, China
Abstract:Word segmentation(WS)is a funamental task in Chinese information processing. To solve the difficulties of traditional methods in processing texts in restricted domains, a novel method is proposed. It requires no lexicon or training corpus and can adapt to various texts and different WS standards. It enables the user to take part in WS procedure and add language kownledge to the system. Using optimized suffix array algrithm, candidates as words are recursively extracted from the text, then judged and edited by the user. Thus, a lexicon of the text is gained and applied to segment the text. Experiments on 4 different texts show that without the user’s judgement, F-score of the system reaches as much as 72%, and can be prompted by 12% with amount of work done by the user. With the increase in the workload of the user, the system is able to achieve better results.
[1] Zhongjian WANG,Kenji ARAKI,Koji TOCHINAI.A Word Segmentation Method with Dynamic Adapting to Text Using Inductive Learning[A].In: Proceedings of the First SIGHAN Workshop on Chinese Language Processing[C]. 2002. 113-117. [2] 王开铸,李俊杰,吴岩.无词典自动分词的研究[A].陈力为,袁琦主编.计算语言学进展与应用[C].北京: 清华大学出版社,1995. [3] 黄萱菁,吴立德,王文欣,等.基于机器学习的无需人工编制词典的切词系统[J].模式识别与人工智能.1996,9(4): 297-303. [4] 傅赛香,袁鼎荣,黄伯雄,等.基于统计的无词典分词方法[J].广西科学院学报,2002, 18(4): 252-255. [5] Xiaopeng Tao,Shuigeng Zhou.Chinese Word Segmentation Without Auxiliary Data[A].Maosong Sun,Tianshun Yao,Chunfa Yuan.In: Advances in Computation of Oriental Languages [C].Beijing: Tsinghua University Press,2003. 88-94. [6] Sun Maosong,Shen Dayang., Hang Changning. Deriving Chinese Lexicons from Large Corpora[A].In: NLPRS-95[C]. Taejon,Korea,1995. [7] 冯冲,陈肇雄,黄河燕,等.基于Multigram语言模型的主动学习中文分词[J].中文信息学报,2006, 20(1): 50-58. [8] 金翔羽,孙正兴,张福炎.一种中文文档的非受限无词典抽词方法[J].中文信息学报,2001, 15(6): 33-39. [9] Luo Zhiyong,Song Rou.An Integrated Method for Chinese Unknown Word Extraction[A].In: Proceedings of 3rd ACL SIGHAN Workshop [C].Barcelona,Spain,2004. 148-154. [10] 罗盛芬,孙茂松.基于字串内部结合紧密度的汉语自动抽词实验研究[J].中文信息学报, 2003, 17(3): 9-14.