汉英MT源语分析首先遇到的问题是词的识别。汉语中的“词”没有明确的定义,语素和词、词和词组、词组和句子,相互之间也没有清楚的界限。按照先分词、再句法分析的办法,会在分词时遇到构词问题和句法问题相互交错的困难。作者认为,可以把字作为源语句法分析的起始点,使词和词组的识别与句法分析同时进行。本文叙述了这种观点及其实现过程,并且以处理离合词为例,说明了识别的基本方法。
Abstract
The first problem we have met in source language analysis in a Chinese-English MT system is Chinese sentence tokenization , as in written Chinese there is no explicit word delimiter. Finding token boundaries for a character string will be often interlaced with syntactic parsing , or even with semantic relations. This paper presents an approach of combination of sentence tokenization and syntactic-semantic analysis. Instead of getting tokenized word string before sentence parsing , the tokenizing component is built into the parser , i. e. syntactic and semantic information could be used for recognizing words when necessary during parsing which is supported by a dictionary with descriptions for individual usage and a set of common rules.
关键词
机器翻译 /
汉语自动分析 /
汉语词的自动识别
{{custom_keyword}} /
Key words
Machine translation /
Chinese language Parsing /
Chinese tokenization
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 董振东. 汉语分词研究漫谈. 语言文字应用,1997 ,1
[2] 黄昌宁. 中文信息处理中的分词问题,语言文字应用,1997 ,1
[3] 刘群,俞士汶. 汉英机器翻译的难点分析,见:1998中文信息处理国际会议论文集,北京:清华大学出版社,1998
[4] 刘倬. 中文信息处理中的切词和句法分析. 中国语文,1985 ,3
[5] 吕叔湘. 汉语语法分析问题. 北京:商务印书馆,1979
[6] 陆志韦. 汉语的构词法. 北京:科学出版社,1957
[7] 王洪君. 从字和字组看词和短语. 中国语文,1994 ,2
[8] 沈阳. 现代汉语复合词的动态类型. 语言教学与研究,1997 ,2
[9] Wu Andi.Word Segmentation in Sentence Analysis , Proceedings of 1998 International Conference on Chinese Information Processing ,1998
[10] 徐通锵. 语言论-语义型语言的结构原理和研究方法. 沈阳:东北师范大学出版社,1997
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}