基于规则和统计的日语分词和词性标注的研究

姜尚仆1,2,陈群秀1,2

PDF(658 KB)
PDF(658 KB)
中文信息学报 ›› 2010, Vol. 24 ›› Issue (1) : 117-123.
综述

基于规则和统计的日语分词和词性标注的研究

  • 姜尚仆1,2,陈群秀1,2
作者信息 +

Study on Japanese Word Segmentation and POS Tagging Based on Rules and Statistics

  • JIANG Shangpu1,2 , CHEN Qunxiu1,2
Author information +
History +

摘要

日语分词和词性标注是以日语为源语言的机器翻译等自然语言处理工作的第一步。该文提出了一种基于规则和统计的日语分词和词性标注方法,使用基于单一感知器的联合分词和词性标注算法作为基本框架,在其中加入了基于规则的词语的邻接属性作为特征。在小规模测试集上的实验结果表明,这种方法分词的F值达到了98.2%,分词加词性标注的F值达到了94.8%。该文所采用的方法已经成功应用到日汉机器翻译系统中。

Abstract

Word segmentation and part-of-speech tagging is the first step of Japanese natural language processing tasks, such as machine translation in which Japanese is the source language. In this paper, a Japanese word segmentation and POS tagging approach based on rules and statistics is proposed. Adopting a single perceptron based joint word segmentation and POS tagging algorithm as the basic framework, this method is combined with the features of adjacency attributes which are derived by heuristic rules. The experiment on a small test dataset shows that the new approach achieves an F-score of 98.2% on word segmentation, and 94.8% on both word segmentation and POS tagging. This work has already been applied into the Japanese-Chinese machine translation system successfully.
Key wordsartificial intelligence; machine translation; Japanese-Chinese machine translation system;Japanese word segmentation;Japanese POS tagging;joint word segmentation

关键词

人工智能 / 机器翻译 / 日汉机器翻译系统 / 日语分词 / 日语词性标注 / 联合分词

Key words

artificial intelligence / machine translation / Japanese-Chinese machine translation system / Japanese word segmentation / Japanese POS tagging / joint word segmentation
 
/   /   /
 
/   /   /
 
/   /  

引用本文

导出引用
姜尚仆1,2,陈群秀1,2. 基于规则和统计的日语分词和词性标注的研究. 中文信息学报. 2010, 24(1): 117-123
JIANG Shangpu1,2 , CHEN Qunxiu1,2. Study on Japanese Word Segmentation and POS Tagging Based on Rules and Statistics. Journal of Chinese Information Processing. 2010, 24(1): 117-123

参考文献

[1] Lawrence. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recogonition[C]//Proceedings of IEEE, 1989.
[2] Patnaparkhi and Adwait. A maximum entropy part-of-speech tagger[C]//Proceedings of the EMNLP, 1996.
[3] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy markov models for information extraction and segmentation[C]//Proceedings of ICML, 2000.
[4] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of ICML, 2001.
[5] Michael Collins. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms[C]//Proceedings of EMNLP, 2002.
[6] F. Peng, F. Feng, and A. McCallum. Chinese segmentation and new word detection using conditional random fields[C]//Proceedings of COLING, 2004.
[7] N. Xue and L. Shen. Chinese word segmentation as LMR tagging[C]//Proceedings of ACL SIGHAN Workshop, 2003.
[8] T. Kudo, K. Yamamoto, and Y. Matsumoto. Applying conditional random fields to Japanese morphological analysis[C]//Proceedings of EMNLP, 2004.
[9] K. Uchimoto, C. Nobata, A. Yamada, S. Sekine, H. Isahara. Morphological analysis of the spontaneous speech corpus[C]//Proceedings of COLING, 2002.
[10] M. Asahara. Corpus-based Japanese morphological analysis[D]. Japan: NAIST, 2003.
[11] T. Nakagawa. Chinese and Japanese word segmentation using word-level and character-level information[C]//Proceedings of COLING, 2004.
[12] Y. Zhang and S. Clark. Chinese segmentation with a word-based perceptron algorithm[C]//Proceedings of ACL, 2007.
[13] H. Ng and J. Low. Chinese part-of-speech tagging: one-at-a-time or all-at-once? Word-based or character-based?[C] //Proceedings of EMNLP, 2004.
[14] Y. Zhang and S. Clark. Joint word segmentation and POS tagging using a single perceptron[C]//Proceedings of ACL, 2008.
[15] W. Jiang, L. Huang, Q. Liu, Y. Lu. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging[C]//Proceedings of ACL, 2008.

基金

国家863计划重点资助项目(2006AA010109)
PDF(658 KB)

615

Accesses

0

Citation

Detail

段落导航
相关文章

/