
李 莉,刘知远,孙茂松

PDF(2527 KB)
PDF(2527 KB)
中文信息学报 ›› 2013, Vol. 27 ›› Issue (6) : 151-158.


  • 李 莉,刘知远,孙茂松
作者信息 +

Automatically Extracting Phrase-level Paraphrases from Chinese-English Parallel Patents

  • LI Li, LIU Zhiyuan, SUN Maosong
Author information +
History +




Automatically extracting phrase-level paraphrases is an important research task in natural language processing (NLP), which has been applied in applications such as information retrieval, query answering and document classification. Moreover, technique patents, as an important carrier of human knowledge and technology, contain abundant information. Hence, automatically extracting phrase-level paraphrases from Chinese-English parallel patents has a positive effect on NLP tasks about technology. In this paper, we aim to extract phrase-level paraphrases from Chinese-English parallel patents automatically using method based on statistical machine translation, and use chunk parsing technology for paraphrase verification. Moreover, to dispose the errors caused by translation ambiguity and bad word alignment, we use distributional similarity to re-rank the extracted phrase-level paraphrases. In experiments, we find that the method based on statistical machine translation gets a precision of 43.20% on Chinese patents while 43.60% on English patents for Top-500 results. Meanwhile, after verification with chunk parsing, the precisions are raised to 75.50% and 52.40%, respectively. Moreover, the re-ranking based on distributional similarity also improves the performance significantly.
Key wordsphrase-level paraphrase; statistical machine translation; chunk parsing; distributional similarity

Key words

phrase-level paraphrase / statistical machine translation / chunk parsing / distributional similarity


李 莉,刘知远,孙茂松. 基于中英平行专利语料的短语复述自动抽取研究. 中文信息学报. 2013, 27(6): 151-158
LI Li, LIU Zhiyuan, SUN Maosong. Automatically Extracting Phrase-level Paraphrases from Chinese-English Parallel Patents. Journal of Chinese Information Processing. 2013, 27(6): 151-158


[1] 张西龙, 季铎, 王岩, 等. 英汉专利语料中长句的分割[J]. 沈阳航空航天大学学报. 2011, 28(5): 67-70.
[2] 张桂平, 刘东生, 尹宝生, 等. 面向专利文献的中文分词技术的研究[J]. 中文信息学报. 2010, 24(3): 112-116.
[3] 岳金媛, 徐金安, 张玉洁. 面向专利文献的汉语分词技术研究[J]. 北京大学学报: 自然科学版. 2013(1): 159-164.
[4] 刘颖, 铁铮, 余畅. 汉英短语翻译对的自动抽取[J]. 计算机应用与软件. 2012, 29(7): 69-72.
[5] 郭丽. 基于上下文的词语相似度计算及其应用 [D][D]. 沈阳航空工业学院, 2009.
[6] 刘挺, 李维刚, 张宇, 等. 复述技术研究综述[J]. 中文信息学报, 2006, 20(4): 25-33.
[7] De Beaugrande R, Dressler W. Introduction to text linguistics[Z]. London: Longman, 1981.
[8] Bazilay R, Mckeown K R. Extracting paraphrases from a parallel corpus[C]//2001.
[9] Bannard C, Callison-Burch C. Paraphrasing with bilingual parallel corpora[C]//2005.
[10] 宗成庆, 张宵军. 统计机器翻译[M]. 电子工业出版社, 2012.
[11] Chan T P, Callison-Burch C, Van Durme B. Reranking bilingually extracted paraphrases using monolingual distributional similarity[C]. 2011.
[12] 周强,孙茂松,黄昌宁. 汉语句子的组块分析体系[J]. 计算机学报. 1999, 22(11): 1158-1165.
[13] 徐中一,胡谦,刘磊. 基于 CRF 的中文组块分析[J]. 吉林大学学报: 理学版. 2007, 45(3): 416-420.
[14] Katz J J. The philosophy of linguistics[M]. Oxford University Press, 1985.
[15] Lin D, Pantel P. Discovery of inference rules for question-answering[J]. Natural Language Engineering. 2001, 7(4): 343-360.
[16] Koehn P, Och F J, Marcu D. Statistical phrase-based translation[C]. 2003.
[17] Brown P F, Pietra V J D, Pietra S A D, et al. The mathematics of statistical machine translation: Parameter estimation[J]. Computational linguistics. 1993, 19(2): 263-311.
[18] Och F J, Ney H. A systematic comparison of various statistical alignment models[J]. Computational linguistics. 2003, 29(1): 19-51.
[19] 李珩,朱靖波,姚天顺. 基于 SVM 的中文组块分析[J]. 中文信息学报. 2004, 18(2): 1-7.
[20] Agirre E, Alfonseca E, Hall K, et al. A study on similarity and relatedness using distributional and wordnet-based approaches[C]. 2009.
[21] Li P, Sun M, Xue P. Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm[C]//Proceedings of Beijing, China: Coling 2010 Organizing Committee, 2010.
[22] Koehn P, Hoang H, Birch A, et al. Moses: Open source toolkit for statistical machine translation[C]. 2007.
[23] Phan X H. Crftagger: Crf english pos tagger[J]. Available at crftagger. source forge. net, 2006.
[24] Kaixu Z, Maosong S. Unified Framework of Performing Chinese Word Segmentation and Part-of-Speech Tagging[J]. CHINA COMMUNICATIONS, 2012, 9(3): 1-9.
[25] Phan X H. Crfchunker: Crf english phrase chunker[C]. PACLIC, 2006.
[26] 周强. 汉语句法树库标注体系[J]. 中文信息学报. 2004, 18(4): 1-8.
[27] Voorhees E M. The philosophy of information retrieval evaluation[C]. 2002.
[28] Miller G A, Beckwith R, Fellbaum C, et al. Introduction to wordnet: An on-line lexical database[J]. International journal of lexicography, 1990, 3(4): 235-244.
[29] Dong Z, Dong Q. HowNet[J]. 2000.
[30] Hatzivassiloglou V, McKeown K R. Towards the automatic identification of adjectival scales: Clustering adjectives according to meaning[C]//Proceedings of the 31st annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 1993: 172-182.
[31] 田久乐,赵蔚. 基于同义词词林的词语相似度计算方法[J]. 吉林大学学报: 信息科学版. 2010(06): 602-608.
[32] Blondel V D, Senellart P P. Automatic extraction of synonyms in a dictionary[J]. vertex, 2011, 1: x1.
[33] Pereira F, Tishby N, Lee L. Distributional clustering of English words[C]//Proceedings of the 31st annual meeting on Association for Computational Linguistics. Association for Computational Linguistics, 1993: 183-190.
[34] Lin D. Automatic retrieval and clustering of similar words[C]//Proceedings of the 17th International Conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 1998: 768-774. .
[35] Ibrahim A, Katz B, Lin J. Extracting structural paraphrases from aligned monolingual corpora[C]//Proceedings of the second international workshop on Paraphrasing-Volume 16. Association for Computational Linguistics, 2003: 57-64.


PDF(2527 KB)






