自动分词是中文信息处理的基础课题之一。为了克服传统分词方法在处理特殊领域文本时遇到的困难,本文提出了一种新的分词方法,在没有词表和训练语料的条件下,让用户参与到分词过程中,增加系统的语言知识,以适应于不同的语料和分词标准。系统采用改进的后缀数组算法,不断提取出候选词语,交给用户进行筛选,最后得到词表进行分词。四个不同语料的实验结果显示,不经过人工筛选,分词F值可以达到72%左右;而经过较少的人机交互,分词F值可以提高12%以上。随着用户工作量的增加,系统还能够进一步提高分词效果。
Abstract
Word segmentation(WS)is a funamental task in Chinese information processing. To solve the difficulties of traditional methods in processing texts in restricted domains, a novel method is proposed. It requires no lexicon or training corpus and can adapt to various texts and different WS standards. It enables the user to take part in WS procedure and add language kownledge to the system. Using optimized suffix array algrithm, candidates as words are recursively extracted from the text, then judged and edited by the user. Thus, a lexicon of the text is gained and applied to segment the text. Experiments on 4 different texts show that without the user’s judgement, F-score of the system reaches as much as 72%, and can be prompted by 12% with amount of work done by the user. With the increase in the workload of the user, the system is able to achieve better results.
关键词
计算机应用 /
中文信息处理 /
自动分词 /
未登录词识别 /
陌生文本 /
人机交互
{{custom_keyword}} /
Key words
computer application /
Chinese information processing /
word segmentation /
unknown word recognition /
unknown text /
Human-Computer Interaction
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Zhongjian WANG,Kenji ARAKI,Koji TOCHINAI.A Word Segmentation Method with Dynamic Adapting to Text Using Inductive Learning[A].In: Proceedings of the First SIGHAN Workshop on Chinese Language Processing[C]. 2002. 113-117.
[2] 王开铸,李俊杰,吴岩.无词典自动分词的研究[A].陈力为,袁琦主编.计算语言学进展与应用[C].北京: 清华大学出版社,1995.
[3] 黄萱菁,吴立德,王文欣,等.基于机器学习的无需人工编制词典的切词系统[J].模式识别与人工智能.1996,9(4): 297-303.
[4] 傅赛香,袁鼎荣,黄伯雄,等.基于统计的无词典分词方法[J].广西科学院学报,2002, 18(4): 252-255.
[5] Xiaopeng Tao,Shuigeng Zhou.Chinese Word Segmentation Without Auxiliary Data[A].Maosong Sun,Tianshun Yao,Chunfa Yuan.In: Advances in Computation of Oriental Languages [C].Beijing: Tsinghua University Press,2003. 88-94.
[6] Sun Maosong,Shen Dayang., Hang Changning. Deriving Chinese Lexicons from Large Corpora[A].In: NLPRS-95[C]. Taejon,Korea,1995.
[7] 冯冲,陈肇雄,黄河燕,等.基于Multigram语言模型的主动学习中文分词[J].中文信息学报,2006, 20(1): 50-58.
[8] 金翔羽,孙正兴,张福炎.一种中文文档的非受限无词典抽词方法[J].中文信息学报,2001, 15(6): 33-39.
[9] Luo Zhiyong,Song Rou.An Integrated Method for Chinese Unknown Word Extraction[A].In: Proceedings of 3rd ACL SIGHAN Workshop [C].Barcelona,Spain,2004. 148-154.
[10] 罗盛芬,孙茂松.基于字串内部结合紧密度的汉语自动抽词实验研究[J].中文信息学报, 2003, 17(3): 9-14.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60275020)
{{custom_fund}}