基于word2vec的大中华区词对齐库的构建

王明文,徐雄飞,徐 凡,李茂西

PDF(1861 KB)
PDF(1861 KB)
中文信息学报 ›› 2015, Vol. 29 ›› Issue (5) : 76-84.
语言资源建设

基于word2vec的大中华区词对齐库的构建

  • 王明文,徐雄飞,徐 凡,李茂西
作者信息 +

Word2vec Based Word Alignment Corpus for the Greater China Region

  • WANG Mingwen,XU Xiongfei,XU Fan,LI Maoxi
Author information +
History +

摘要

该文针对大陆、香港和台湾地区(简称大中华区)存在同一种语义但采用不同词语进行表达的语言现象进行分析。首先,我们抓取了维基百科以及简繁体新闻网站上的3 200 000万组大中华区平行句对,手工标注了一致性程度达到95%以上的10 000组大中华区平行词对齐语料库。同时,我们提出了一个基于word2vec的两阶段大中华区词对齐模型,该模型采用word2vec获取大中华区词语的向量表示形式,并融合了有效的余弦相似度计算方法以及后处理技术。实验结果表明我们提出的大中华区词对齐模型在以上两种不同文体的词对齐语料库上的F1值显著优于现有的GIZA++和基于HMM的基准模型。此外,我们在维基百科上利用该词对齐模型进一步生成了90 029组准确率达82.66%的大中华区词语三元组。

Abstract

We deal with the linguistic phenomenon that different expressions to the same semantic meaning among the Mainland China, Hong Kong and Taiwan, a.k.a., the greater China region(GRC). Firstly, we automatically crawl 3.2 million GCR parallel sentences from the wikipedia and the news website with simplified and traditional encoding, and then manually annotate 10 000 GCR parallel word alignment corpora with an annotation agreement of more than 95%. Meanwhile, we present a 2-phase GCR word alignment model based on word2vec representation of the GCR words the cosine similarity measure and other post-processing techniquest. Experiment results on the proposed 2 different word alignment corpus demenstrate the effectiveness of our GCR model which significantly outperforms the current GIZA++ and HMM-based models. Furthermore, we generate 90,029 triples from wikipedia with accuracy over 82.66%.

关键词

大中华区 / 词对齐 / 最长公共子序列 / word2vec

Key words

the greater China region;word alignment / the longest common subsequence;word2vec

引用本文

导出引用
王明文,徐雄飞,徐 凡,李茂西. 基于word2vec的大中华区词对齐库的构建. 中文信息学报. 2015, 29(5): 76-84
WANG Mingwen,XU Xiongfei,XU Fan,LI Maoxi. Word2vec Based Word Alignment Corpus for the Greater China Region. Journal of Chinese Information Processing. 2015, 29(5): 76-84

参考文献

[1] Ayan N F, Dorr B J. Going beyond AER: An extensive analysis of word alignments and their impact on MT[C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006: 9-16.
[2] Takezawa T, Sumita E, Sugaya F, et al. Toward a Broad-coverage Bilingual Corpus for Speech Translation of Travel Conversations in the Real World[C]//Proceedings of the 3rd International Conference on Language Resources and Evaluatio. 2002: 147-152.
[3] Mihalcea R, Pedersen T. An evaluation exercise for word alignment[C]//Proceedings of the HLT-NAACL 2003 workshop on building and using parallel texts: data driven machine translation and beyond-Volume 3. Association for Computational Linguistics, 2003: 1-10.
[4] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of the advances in neural information processing systems. 2013: 3111-3119.
[5] Brown P F, Pietra V J D, Pietra S A D, et al. The mathematics of statistical machine translation: Parameter estimation[J]. Computational linguistics, 1993, 19(2): 263-311.
[6] Vogel S, Ney H, Tillmann C. HMM-based word alignment in statistical translation[C]//Proceedings of the 16th Conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 1996: 836-841.
[7] Neubig G, Watanabe T, Sumita E, et al. An unsupervised model for joint phrase alignment and extraction[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 2011: 632-641.
[8] Songyot T, Chiang D. Improving word alignment using word similarity[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014: 1840-1845.
[9] Kondo S, Duh K, Matsumoto Y. Hidden Markov Tree Model for Word Alignment[C]//Proceedings of the 8th Workshop on Statistical Machine Translation. 2013: 503.
[10] Chang Y W, Rush A, DeNero J, et al. A Constrained Viterbi Relaxation for Bidirectional Word Alignment[J]. Annual Meeting of the Association for Computational Linguistics. 2014: 1481-1490.
[11] Tamura A, Watanabe T, Sumita E. Recurrent neural networks for word alignment model[C]//Proceedings of EMNLP. 2014: 1470-1480.
[12] Yang N, Liu S, Li M, et al. Word Alignment Modeling with Context Dependent Deep Neural Network[C]// Annual Meeting of the Association of Computational Linguistics. 2013: 166-175.
[13] Blunsom P, Cohn T. Discriminative word alignment with conditional random fields[C]//Proceedings of the Annual Meeting of the Association of Computational Linguistics, 2006, 65-72.
[14] Blunsom P, Cohn T. Discriminative word alignment with conditional random fields[C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006: 65-72.
[15] Taskar B, Lacoste-Julien S, Klein D. A discriminative matching approach to word alignment[C]//Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2005: 73-80.
[16] Chvatal V, Sankoff D. Longest common subsequences of two random sequences[J]. Journal of Applied Probability, 1975: 306-315.
[17] Katoh K, Misawa K, Kuma K, et al. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform[J]. Nucleic acids research, 2002, 30(14): 3059-3066.
[18] Pang B, Lee L. Opinion mining and sentiment analysis[J]. Foundations and trends in information retrieval, 2008, 2(1-2): 1-135.
[19] DeNero J, Klein D. Tailoring word alignments to syntactic machine translation[C]//Proceedings of the Annual Meeting of the Association of Computational Linguistics. 2007, 45(1): 17-24.

基金

国家自然科学基金(61462045,61402208,61462044);国家语委"十二五"规划(YB125-99);江西省自然科学基金(20132BAB201030,20151BAB207027,20151BAB207025)
PDF(1861 KB)

471

Accesses

0

Citation

Detail

段落导航
相关文章

/