中文输入法是中文信息处理的难题之一。随着互联网上中文用户的不断增加,中文输入法的重要性也变得日益突出。本文在对句子中长距离词汇依赖现象观察的基础上,抽取出语料库中的词汇搭配来获取长距离特征,并以此构建基于词汇搭配关系的拼音输入法系统;同时将词汇搭配的思想应用到拼音输入法的用户模型中,从而使我们的输入法系统能够辅助用户更加有效的输入。实验表明基于词汇搭配关系的改进方法对提高输入法的准确率有积极的作用。
Abstract
Chinese input method is one of the key challenges in Chinese information processing. With the rapidly in2 crease of the number of Chinese web surfers , the efficiency of the Chinese input method has becomes more and more important . Based on observations of the long2term dependencies in sentences , we implemented a collocation2based pinyin input system by using the collocations we ext racted f rom large2scale corpus. This system has the ability to capture the long2term word collocations. The idea is further int roduced into our personalization module of our Pinyin system to help the user input Chinese more efficiently. The experiment result s show the methods we propose in this paper are promising.
关键词
计算机应用 /
中文信息处理 /
中文输入法 /
中文信息处理 /
统计语言模型 /
词汇搭配 /
长距离特征 /
用户模型
{{custom_keyword}} /
Key words
computer application /
chinese information processing /
Chinese input method /
Chinese information processing /
statistics language model /
collocations /
longterm dependence /
user model
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1 ] 王晓龙. 音字流切分及相互转换的理论研究与系统实现[D] . 哈尔滨工业大学博士学位论文,1989.
[2 ] 王晓龙. 语句级汉字输入技术[ J ] . 中文信息学报, 1996 ,10 (4) : 50259.
[3 ] 徐志明, 王晓龙, 姜守旭. 一种语句级汉字输入技术的研究[J ] . 高技术通讯, 10 (1) : 51256.
[4 ] Zheng Chen , and Kai2Fu Lee. A new statistical ap2 proach to Chinese pinyin input [ A ] . ACL22000. In : The 38th Annual Meeting of the Association for Com2 putational Linguistics[ C] . Hong Kong , 326 October , 2000.
[5 ] 陈正,李开复. 拼音纠正在拼音输入法中的应用[J ] . 计算机学报,2001 ,24 (7) : P7582763.
[6 ] 章森,宗成庆,陈肇雄等. 语句拼音—汉字转换的智能处理机制分析[J ] . 中文信息学报,1998 , 12 (2) : 372 43.
[7 ] Yaacov Choueka. Looking for Needles in a Haystack , or Locating Interesting Collocational Expressions in Large Textual Databases [ A ] . In : Proceedings of the RIAO , MIT[C] . 1988. 6092623.
[8 ] Pavel Pecina , Pavel Schlesinger. Combining Associa2 tion Measures for Collocation Ext raction[A] . In : The Annual Meeting of the Association for Computatinal Lingaistics[C] . Sydney , 2006 .
[9 ] Carmen Alvarez , Philippe Langlais , J ian2Yun Nie. Word Pairs in Language Modeling for Information Re2 t rieval[A] . In : Proceedings of Recherche d’Informa2 tion Assistee par ordinateur [C] . 2004.
[10 ] 吴光远,何丕廉. 基于向量空间模型的词共现研究及其在文本分类中的应用[ J ] . 计算机应用, 2003 , 23 (6) : 1382145.
[11 ] Ramesh Nallapati , J ames Allan. Capturing Term De2 pendencies using a Sentence Tree based Language Model [A ] . Eleventh International Conference on In2 formation and Knowledge Management [C] . 2002.
[12 ] Kenneth Ward Church , Pat rick Hanks. Word Asso2 ciation Norms , Mutual Information , and Lexicogra2 phy[A] . In : Proceding of the 27th. Annual Meeting of the Association for Computatinal Linguistics [ C] . 76283.
[13 ] 张华平. 计算所汉语词法分析系统ICTCLAS[ R ] . http : / / www. nlp. org. cn/ .
[14 ] 刘群,李素建. 基于《知网》的词汇语义相似度计算 [J ] . Computational Linguistics and Chinese Language Processing ,2002 ,12. 8 ,59276.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}