基于多译文的中文转述语料库建设及转述评价方案

阮翀,施文娴,李岩昊,翁伊嘉,胡俊峰

PDF(1567 KB)
PDF(1567 KB)
中文信息学报 ›› 2018, Vol. 32 ›› Issue (12) : 67-73.
语言资源建设

基于多译文的中文转述语料库建设及转述评价方案

  • 阮翀1,2,施文娴1,2,李岩昊2,翁伊嘉2,胡俊峰1,2
作者信息 +

Multi-translation Based Chinese Paraphrase: Evaluation Metric and Corpus

  • RUAN Chong1,2, SHI Wenxian1,2, LI Yanhao2, WENG Yijia2, HU Junfeng1,2
Author information +
History +

摘要

转述语料是转述现象研究的基础。针对目前学术界中文转述语料稀缺的现状,该文以《简爱》的多个中文译本为基础,通过句对齐得到五万句级别的平行转述语料。使用无监督的小句对齐和词对齐算法,从语料中挖掘到九千多对词汇转述知识。同时,还复现和改进了机器翻译测评指标 Meteor,使得该指标更适合于中文转述句子的测评,并构造了一个中文句子转述测评数据集,以便对不同的转述知识和评价指标进行比较。实验表明,该文算法挖掘到的词汇转述知识在封闭测试中不逊于《同义词词林》。

Abstract

Paraphrase corpus is fundamental to research in paraphrase phenomenon, while Chinese paraphrase corpus is hardly available in academia. In this paper, we collected multiple Chinese translations of the novel Jane Eyre, obtaining roughly 50 000 parallel paraphrasing sentences. Then, we managed to extract more than 9 000 pairs of lexical paraphrase knowledge. We further modified METEOR, an automatic machine translation evaluation metric, to better evaluate Chinese paraphrase quality and provided a Chinese paraphrase evaluation dataset. The close test proved a better quality of our mined knowledge than that of Tongyici Cilin.

关键词

转述知识挖掘 / 转述评价指标 / 转述语料库建设

Key words

paraphrase knowledge mining / paraphrasing evaluation metric / paraphrase corpus construction

引用本文

导出引用
阮翀,施文娴,李岩昊,翁伊嘉,胡俊峰. 基于多译文的中文转述语料库建设及转述评价方案. 中文信息学报. 2018, 32(12): 67-73
RUAN Chong, SHI Wenxian, LI Yanhao, WENG Yijia, HU Junfeng. Multi-translation Based Chinese Paraphrase: Evaluation Metric and Corpus. Journal of Chinese Information Processing. 2018, 32(12): 67-73

参考文献

[1] Dolan B,Brockett C,Quirk C.Microsoft research paraphrase corpus[J].Retrieved March,2005,(29): 2008.
[2] Lin T Y,et al.Microsoft coco: Common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision.Springer,Cham,2014: 740-755.
[3] 董振东,董强.知网和汉语研究[J].当代语言学,2001,3(1):33-44.
[4] 梅家驹.同义词词林[M].上海: 上海辞书出版社,1983.
[5] Banerjee S,Lavie A.Meteor: An automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization,2005: 65-72.
[6] Denkowski M,Lavie A.METEOR-next and the meteor paraphrase tables: Improved evaluation support for five target languages[C]//Proceedings of the Joint 5th Workshop on Statistical Machine Translation and Metrics MATR.Association for Computational Linguistics,2010: 339-342.
[7] Denkowski M,Lavie A.Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems[C]//Proceedings of the 6th Workshop on Statistical Machine Translation.Association for Computational Linguistics,2011: 85-91.
[8] Denkowski M,Lavie A.Meteor universal: Language specific translation evaluation for any target language[C]//Proceedings of the 9th Workshop on Statistical Machine Translation,2014: 376-380.
[9] Wang T,Hirst G.Exploring patterns in dictionary definitions for synonym extraction[J].Natural Language Engineering,2012,18(3): 313-342.
[10] Turney P D.Mining the web for synonyms: PMI-IR versus LSA on TOEFL[C]//Proceedings of the 12th European Conference on Machine Learning.Springer,Berlin,Heidelberg,2001: 491-502.
[11] Bannard C,Callison-Burch C.Paraphrasing with bilingual parallel corpora[C]//Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics.Association for Computational Linguistics,2005: 597-604.
[12] Barzilay R,McKeown K R.Extracting paraphrases from a parallel corpus[C]//Proceedings of the 39th annual meeting on Association for Computational Linguistics.Association for Computational Linguistics,2001: 50-57.
[13] Liu C,Dahlmeier D,Ng H T.PEM: A paraphrase evaluation metric exploiting parallel texts[C]//Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2010: 923-932.
[14] Papineni K,et al.BLEU: A method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.Association for Computational Linguistics,2002: 311-318.
[15] Moore R C.Fast and accurate sentence alignment of bilingual corpora[C]//Proceedings of Conference of the Association for Machine Translation in the Americas.Springer,Berlin,Heidelberg,2002: 135-144.
[16] Gale W A,Church K W.A program for aligning sentences in bilingual corpora[J].Computational Linguistics,1993,19(1): 75-102.
[17] Brown P F,et al.The mathematics of statistical machine translation: Parameter estimation[J].Computational Linguistics,1993,19(2): 263-311.
[18] Lacoste-Julien S,et al.Word alignment via quadratic assignment[C]//Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics.Association for Computational Linguistics,2006: 112-119.
[19] Mikolov T,et al.Efficient estimation of word representations in vector space[J].arXiv preprint,2013.
[20] Mikolov T,et al.Distributed representations of words and phrases and their compositionality[J].arXiv preprint,2013.
[21] Och F J,Ney H.A systematic comparison of various statistical alignment models[J].Computational Linuistics,2003,29(1): 19-31.
[22] Bron C,Kerbosch J.Algorithm 457: Finding all cliques of an undirected graph[J].Communications of the ACM,1973,16(9): 575-577.
[23] Luong T,Pham H,Manning C D.Effective Approaches to Attention-based Neural Machine Translation[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing,2015: 1412-1421.
[24] Ma S,et al.Bag-of-Words as target for neural machine translation[J].arXiv preprint,2018,arXiv: 1805.04871.
[25] Kingma D,Ba J.Adam: A method for stochastic optimization[J].arXiv preprint,2014,arXiv,1412,6980.
[26] Yang L,Sun M.Improved learning of Chinese word embeddings with semantic knowledge[M].Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data.Springer,Cham,2015: 15-25.
[27] Xu J,et al.Improve Chinese word embeddings by exploiting internal structure[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,2016: 1041-1050.

基金

国家自然科学基金(61472017)
PDF(1567 KB)

Accesses

Citation

Detail

段落导航
相关文章

/