基于中英平行专利语料的短语复述自动抽取研究

李 莉,刘知远,孙茂松

PDF(2527 KB)
PDF(2527 KB)
中文信息学报 ›› 2013, Vol. 27 ›› Issue (6) : 151-158.
综述

基于中英平行专利语料的短语复述自动抽取研究

  • 李 莉,刘知远,孙茂松
作者信息 +

Automatically Extracting Phrase-level Paraphrases from Chinese-English Parallel Patents

  • LI Li, LIU Zhiyuan, SUN Maosong
Author information +
History +

摘要

短语复述自动抽取是自然语言处理领域的重要研究课题之一,已广泛应用于信息检索、问答系统、文档分类等任务中。而专利语料作为人类知识和技术的载体,内容丰富,实现基于中英平行专利语料的短语复述自动抽取对于技术主题相关的自然语言处理任务的效果提升具有积极意义。该文利用基于统计机器翻译的短语复述抽取技术从中英平行专利语料中抽取短语复述,并利用基于组块分析的技术过滤短语复述抽取结果。而且,为了处理对齐错误和翻译歧义引起的短语复述抽取错误,我们利用分布相似度对短语复述抽取结果进行重排序。实验表明,基于统计机器翻译的短语复述抽取在中英文上准确率分别为43.20%和43.60%,而经过基于组块分析的过滤技术后准确率分别提升至75.50%和52.40%。同时,利用分布相似度的重排序算法也能够有效改进抽取效果。

Abstract

Automatically extracting phrase-level paraphrases is an important research task in natural language processing (NLP), which has been applied in applications such as information retrieval, query answering and document classification. Moreover, technique patents, as an important carrier of human knowledge and technology, contain abundant information. Hence, automatically extracting phrase-level paraphrases from Chinese-English parallel patents has a positive effect on NLP tasks about technology. In this paper, we aim to extract phrase-level paraphrases from Chinese-English parallel patents automatically using method based on statistical machine translation, and use chunk parsing technology for paraphrase verification. Moreover, to dispose the errors caused by translation ambiguity and bad word alignment, we use distributional similarity to re-rank the extracted phrase-level paraphrases. In experiments, we find that the method based on statistical machine translation gets a precision of 43.20% on Chinese patents while 43.60% on English patents for Top-500 results. Meanwhile, after verification with chunk parsing, the precisions are raised to 75.50% and 52.40%, respectively. Moreover, the re-ranking based on distributional similarity also improves the performance significantly.
Key wordsphrase-level paraphrase; statistical machine translation; chunk parsing; distributional similarity

Key words

phrase-level paraphrase / statistical machine translation / chunk parsing / distributional similarity

引用本文

导出引用
李 莉,刘知远,孙茂松. 基于中英平行专利语料的短语复述自动抽取研究. 中文信息学报. 2013, 27(6): 151-158
LI Li, LIU Zhiyuan, SUN Maosong. Automatically Extracting Phrase-level Paraphrases from Chinese-English Parallel Patents. Journal of Chinese Information Processing. 2013, 27(6): 151-158

参考文献

[1] 张西龙, 季铎, 王岩, 等. 英汉专利语料中长句的分割[J]. 沈阳航空航天大学学报. 2011, 28(5): 67-70.
[2] 张桂平, 刘东生, 尹宝生, 等. 面向专利文献的中文分词技术的研究[J]. 中文信息学报. 2010, 24(3): 112-116.
[3] 岳金媛, 徐金安, 张玉洁. 面向专利文献的汉语分词技术研究[J]. 北京大学学报: 自然科学版. 2013(1): 159-164.
[4] 刘颖, 铁铮, 余畅. 汉英短语翻译对的自动抽取[J]. 计算机应用与软件. 2012, 29(7): 69-72.
[5] 郭丽. 基于上下文的词语相似度计算及其应用 [D][D]. 沈阳航空工业学院, 2009.
[6] 刘挺, 李维刚, 张宇, 等. 复述技术研究综述[J]. 中文信息学报, 2006, 20(4): 25-33.
[7] De Beaugrande R, Dressler W. Introduction to text linguistics[Z]. London: Longman, 1981.
[8] Bazilay R, Mckeown K R. Extracting paraphrases from a parallel corpus[C]//2001.
[9] Bannard C, Callison-Burch C. Paraphrasing with bilingual parallel corpora[C]//2005.
[10] 宗成庆, 张宵军. 统计机器翻译[M]. 电子工业出版社, 2012.
[11] Chan T P, Callison-Burch C, Van Durme B. Reranking bilingually extracted paraphrases using monolingual distributional similarity[C]. 2011.
[12] 周强,孙茂松,黄昌宁. 汉语句子的组块分析体系[J]. 计算机学报. 1999, 22(11): 1158-1165.
[13] 徐中一,胡谦,刘磊. 基于 CRF 的中文组块分析[J]. 吉林大学学报: 理学版. 2007, 45(3): 416-420.
[14] Katz J J. The philosophy of linguistics[M]. Oxford University Press, 1985.
[15] Lin D, Pantel P. Discovery of inference rules for question-answering[J]. Natural Language Engineering. 2001, 7(4): 343-360.
[16] Koehn P, Och F J, Marcu D. Statistical phrase-based translation[C]. 2003.
[17] Brown P F, Pietra V J D, Pietra S A D, et al. The mathematics of statistical machine translation: Parameter estimation[J]. Computational linguistics. 1993, 19(2): 263-311.
[18] Och F J, Ney H. A systematic comparison of various statistical alignment models[J]. Computational linguistics. 2003, 29(1): 19-51.
[19] 李珩,朱靖波,姚天顺. 基于 SVM 的中文组块分析[J]. 中文信息学报. 2004, 18(2): 1-7.
[20] Agirre E, Alfonseca E, Hall K, et al. A study on similarity and relatedness using distributional and wordnet-based approaches[C]. 2009.
[21] Li P, Sun M, Xue P. Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm[C]//Proceedings of Beijing, China: Coling 2010 Organizing Committee, 2010.
[22] Koehn P, Hoang H, Birch A, et al. Moses: Open source toolkit for statistical machine translation[C]. 2007.
[23] Phan X H. Crftagger: Crf english pos tagger[J]. Available at crftagger. source forge. net, 2006.
[24] Kaixu Z, Maosong S. Unified Framework of Performing Chinese Word Segmentation and Part-of-Speech Tagging[J]. CHINA COMMUNICATIONS, 2012, 9(3): 1-9.
[25] Phan X H. Crfchunker: Crf english phrase chunker[C]. PACLIC, 2006.
[26] 周强. 汉语句法树库标注体系[J]. 中文信息学报. 2004, 18(4): 1-8.
[27] Voorhees E M. The philosophy of information retrieval evaluation[C]. 2002.
[28] Miller G A, Beckwith R, Fellbaum C, et al. Introduction to wordnet: An on-line lexical database[J]. International journal of lexicography, 1990, 3(4): 235-244.
[29] Dong Z, Dong Q. HowNet[J]. 2000.
[30] Hatzivassiloglou V, McKeown K R. Towards the automatic identification of adjectival scales: Clustering adjectives according to meaning[C]//Proceedings of the 31st annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 1993: 172-182.
[31] 田久乐,赵蔚. 基于同义词词林的词语相似度计算方法[J]. 吉林大学学报: 信息科学版. 2010(06): 602-608.
[32] Blondel V D, Senellart P P. Automatic extraction of synonyms in a dictionary[J]. vertex, 2011, 1: x1.
[33] Pereira F, Tishby N, Lee L. Distributional clustering of English words[C]//Proceedings of the 31st annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 1993: 183-190.
[34] Lin D. Automatic retrieval and clustering of similar words[C]//Proceedings of the 17th International Conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 1998: 768-774. .
[35] Ibrahim A, Katz B, Lin J. Extracting structural paraphrases from aligned monolingual corpora[C]//Proceedings of the second international workshop on Paraphrasing-Volume 16. Association for Computational Linguistics, 2003: 57-64.

基金

国家自然科学基金资助项目(61133012);国家863计划资助项目(2012AA011102)
PDF(2527 KB)

599

Accesses

0

Citation

Detail

段落导航
相关文章

/