Uyghur Chinese Sentence Alignment Based on Multi Featuresand Optimal Matching
Ni Yaoqun 1,2,3, Xu Hongbo 1, Cheng Xueqi 1
1. CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; 2. Department of Language Engineering, University of Chinese Academy of Sciences,Beijing 100049, China; 3. Department of Language Engineering, University of Foreign Languages, Luoyang, Henan 471003, China
Abstract:The content of Uyghur webpage news is usually partial comparable with the content of the Chinese counterpart. Uyghur sentence sequences may be shuffled or even partially missing in Chinese text, which cause some difficulties in mining parallel sentences (i.e. sentence bead) from bilingual news. Fist, to improve the word matching rate of this kind, person and location names in Chinese are extracted and translated into Uyghur to enhance bilingual mapping. Then we scan the Chinese sentences with translation of Uighur words and calculate the translation rate via string matching to avoid mistakes in Chinese word segmentation. The final similarity of a sentence pair is calculated by combining the word translation rate with the numbers, punctuations and length of sentences as features. Similarities of all the bilingual sentence pairs constructed a weight matrix. We used greedy algorithm and maximum weight matching algorithm in bipartite graph to find the parallel sentence pairs with highest probability. Our method achieves an accuracy of 95.67% in sentence alignment.
[1] Pascale Fung, Percy Cheung. Multi-level Bootstrapping for Extracting Parallel Sentences from a Quasi-Comparable Corpus[C]//Proceedings of the 20th international conference on Computational,2004. [2] 田生伟,吐尔根·伊布拉音,禹龙,等.与策略汉维句子对齐[J].计算机科学,2010,37(4):215-218. [3] William A Gale,Kenneth W Church. A program for aligning sentences in bilingual corpora[C]//Proceedings of the ACL-91. [4] Dekai Wu. Aligning a parallel English-Chinese corpus statistically with lexical criteria[C]//Proceedings of the 32nd annual meeting of the association for computational linguistics, Las cruces, New Mexico. [5] 吴宏林,刘绍明,于戈.基于加权二部图的汉日词对齐[J],中文信息学报,2011,21(5): 101-106. [6] Samat mamitimin, Min Hou. Chinese-Uyghur Sentence Alignment: An approach Based on Anchor Sentences[C]//Proceedings of the 2nd Workshop on Building and Using Comparable Corpora, ACL-IJCNLP 2009. [7] 李佳正,刘凯,麦热哈巴·艾力,等. 维吾尔语中汉族人名的识别及翻译[J],中文信息学报,2011,25(4): 82-87. [8] Batuer Aisha, Maosong Sun. A Statistical Method for Uyghur Tokenization[C]//Proceedings of the Natural Language Processing and Knowledge Engineering, 2009. [9] Ran Duan, Hsin-Hao Su. A Scaling Algorithm for Maximum Weight Matching inBipartite Graphs[C]//Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms,2012. [10] 王斌. 汉英双语语料库自动对齐研究[D]. 中国科学院计算技术研究所博士学位论文,2000. [11] 塞麦提·麦麦提敏,亚森·伊明. 基于转换规则的汉文-维文专有名词自动翻译研究[C].第七届中文信息处理国际会议,2007.