命名实体的翻译等价对在跨语言信息处理中非常重要。传统抽取方法通常使用平行语料库或可比语料库,此类方法受到语料库资源的质量和规模的限制。在日汉翻译领域,一方面,双语资源相对匮乏;另一方面,对于汉字命名实体,通常使用汉字对照表;对于日语纯假名的命名实体,通常采用统计翻译模型,此类方法受到平行语料库的质量和规模的限制,且精度低下。针对此问题,该文提出了一种基于单语语料的面向日语假名的日汉人名翻译对自动抽取方法。该方法首先使用条件随机场模型,分别从日语和汉语语料库中抽取日语和汉语人名;然后,采用基于实例的归纳学习法自动获取人名实体的日汉音译规则库,并通过反馈学习来迭代重构音译规则库。使用音译规则库计算日汉人名实体之间的相似度,给定阈值判定人名实体翻译等价对。实验结果表明,提出的方法简单高效,在实现系统高精度的同时,克服了传统方法对双语资源的依赖性。
Abstract
Named entity translation equivalents play a critical role in cross-language information processing. The traditional method is usually based on large-scale parallel or comparable corpus, which is limited by the size and quality of the corpus resources. In Japanese-Chinese translation, the bilingual corpora resources are relatively scarce: the Chinese Hanzi and Japanese Kanji mapping table is often adopted to deal with Chinese named entity and a SMT model to deal with the Japanese named entities in pure kana. In this paper, we propose a monolingual corpora based approach. Firstly, the conditional random field model is adopted to extract Japanese and Chinese names from monolingual corpus. Then the Japanese-Chinese transliteration rule base is developed by instance based inductive learning in a iterative process employing the feedback learning. Experimental results show that the proposed method is simple and efficient, leverging the severely dependency on bilingual resource by the classical methods.
关键词
机器翻译 /
命名实体 /
日语假名 /
归纳学习法 /
音译
{{custom_keyword}} /
Key words
machine translation /
named entities /
Japanese kana /
inductive learning method /
transliteration
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] D Bikel, S Miller, R Schwartz, et al. A high-performance learning name-finder[C]//Proceedings of Applied Natural Language Processing,Washington DC:1997.
[2] 李婷婷,赵铁军,张春越. 基于统计的日本人名的识别和翻译[J]. 智能计算机与应用, 2012, 2(1) :4-7.
[3] 赵军. 命名实体识别、排歧和跨语言关联[J]. 中文信息学报,2009,23(2):3-17
[4] 邹波,赵军. 英汉人名音译方法研究[A]. 第四届全国学生计算语言学研讨会会议论文集[C],2008:24-30.
[5] Jenq-Haur Wang, Jei-Wen Teng, Pu-Jen Cheng,et al. Translating unknown cross-lingual queries in digital libraries using a web-based approach[C]//Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries.ACM,2004:108-116
[6] Jiang L,Zhou M,Chien L F,et al.Named entity translation with web mining and Tansliteration[C]//Proceedings of the IJCAI.2007,7:1629-1634
[7] 蒋龙,周明,简立峰. 利用音译和网络挖掘翻译命名实体[J].中文信息学报,2007,21(1):23-28.
[8] Huang F, Vogel S, Waibel A. Automatic Extraction of Named Entity Translingual Equivalence Based on Multi-Feature Cost Minimization[C]//Proceeding of Association of Computational Linguistics, Sapporo,Japan,2003.
[9] 茹旷. 日汉双语命名实体对获取方法及其应用研究[D]. 北京交通大学,2014.
[10] Ru K,Xu J,Zhang Y,et al.A Method to Construct Chinese-Japanese Named Entity Translation Equivalents Using Monolingual Corpora[A].Natural Language Processing and Chinese Computing. Springer Berlin Heidelberg,2013:164-175
[11] 荒木健治,高橋祐治,桃内佳雄,等.帰納的学習を用いたかな漢字変換[C]//電子情報通信学会論文誌,1996,J79-D-Ⅱ(3):391-402.
[12] 罗晓莹. 日语假名罗马字标记法的历史及发展[J]. 郑州航空工业管理学院学报(社会科学版). 2014.
[13] 孙镇,王惠临. 命名实体识别研究进展综述[J]. 现代图书情报技术,2010,(6):42-47.
[14] John Lafferty, Andrew McCallum, Fernando C N Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, 2001.
[15] 何功星. 日语中日汉人名的声调规则[J]. 科技信息,2011,(17).
[16] http://www.statmt.org/moses/
[17] http://code.google.com/p/giza-pp/downloads/detail?name=giza-pp-v1.0.7.tar.gz
[18] http://www.aies.cn/pinyin.htm
[19] http://o-oo.net.cn/katakana-Roman.asp
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金 (61370130,61473294);中央高校基本科研业务费专项资金 (2015JBM033);国家国际科技合作专项资助(2014DFA11350)
{{custom_fund}}