Research on Large-scale Sino-Tibetan Bilingual Corpus Construction for Natural Language ProcessingTse ring rgyal
Computer Department of Qinghai Normal University, Qinghai Normal University Tibetan, Information Processing Centre Provincal Key Laborarity of Ministry of Education, Qinghai Provincal Tibetan Information Research Centre, Xining, Qinghai 810008, China
Abstract:The costruction of bilingual Corpus and its automatic alignment research are of vital importance for the development of the computational linguistics. So far various types of Chinese-English bilingual corpus, including substantial sentnece aligned corpus for MT, have been developed both in China and abroad. In order to start the MT research involving minority with the state-of-arts technology, the research on the automatic alignments at the discourse level, paragraph level and sentence level between the Chinese and Tibetan vi-texts are necessary. This paper introduces a project on the Sino-Tibetanbilingual corpus alignments, the Chinese -Tibetan bilingual dictionary extraction, and the key technologies in the corpus collection, storage and retrieval. The project has accomplished such technologies as the Tibetan coding identification and conversion, th Tibetan corpus construction, the Sino-Tibetan bilingual dictionary extraction, the Sino-Tibetan sentence alignment and word alignments, and finally achieving a large-scale aligned Sino-Tibetan bilingual corpus for Chinese-Tibetan machine translation. Key wordsChinese-Tibetan machine translation; Chinese-Tibetan bilingual corpus; coding; alignment technology
[1] Fei Huang, Ying Zhang, Stephan Vogel. Mining Key Phrase Translations from Web Corpora[C]//The Proceedings of the HLT-TMNLP-2005: 483-490. [2] Dekai WU, Pascale FUNG. Inversion Transduction Grammar Constraints for Mining Parallel Sentences from Quasi-Comparable Corpora[C]//Second International Joint Conference on Natural Language Processing (IJCNLP-2005): 257-268. [3] Gaolin Fang, Hao Yu, Fumihito Nishino. Web-Based Terminology Translation Mining[C]//Second International Joint Conference on Natural Language Processing (IJCNLP-2005): 1004-1016. [4] 揭春雨,刘晓月,冼景炬,等. 从网络获取香港法律双语语料库[C]//全国第八届计算语言学联合学术会议(JSCL-2005): 193-199. [5] Zhang, Y., Vines. Using the Web for Automated Translation Txtraction in Cross-Language Information Retrieval[C]//Proceedings of SIGIR-2004: 162-169. [6] 常宝宝,詹卫东,张化瑞. 面向汉英机器翻译的双语语料库的建设及其管理[J].术语标准化与信息技术,2003,(1): 28-31. [7] Pu-Jen Cheng, Wen-Hsiang Lu, Jer-Wen Teng, et al. Creating Multilingual Translation Lexicons with Regional Variations Using Web Corpora[C]//Annual Meeting of the Association for Computational Linguistics (ACL-2004). [8] Pu-Jen Cheng, Jei-Wen Teng, Ruei-Cheng Chen, et al. Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval[C]//The Proceedings of the SIGIR-2004. [9] 原双庆,李芳,盛焕烨. 多语种翻译词汇的在线自动抽取[J]. 计算机研究与发展, 2004,(5): 843-847. [10] Philip Resnik, Noah A. Smith. The Web as a Parallel Corpus[J]. Computational Linguistics, 2003, 29(3): 349-380. [11] W. Kraaij, J.-Y. Nie, M. Simard. Tmbedding Web-based Statistical Translation Models in Cross-Language Information Retrieval[J]. Computational Linguistics, 2003, 29(3): 381-419. [12] 刘非凡,赵军,徐波. 大规模非限定领域汉英双语语料库建设及句子对齐研究[C]//全国第7届计算语言学联合学术会议, 2003: 339-345. [13] 孙茂松,陈群秀.语言计算与基于内容的文本处理[M],清华大学出版社,2003,7,97-102. [14] 淑琴,那顺乌日图. 面向TBMT系统的汉蒙双语语料库的构建[C]//少数民族语言信息技术研究进展-中国少数民族语言信息技术与语言资源库建设学术研讨会论文集,北京,2004,4,156-163. [15] 那顺乌日图,淑琴. 面向信息处理的蒙古语规范化研究[J].中央民族大学学报,2007,34(6): 115-122. [16] 才让加. 藏语语料库词类描述方法研究[J]. 计算机工程与应用,2011,47(4): 146-148. [17] 阿比达·吾买尔,吐尔根·依布拉音.维吾尔语句子边界识别算法的设计与实现[J].新疆大学学报,2008,(3): 360-363. [18] 田生伟,吐尔根·依布拉音. TBMT中加权的维吾尔语单词哈希表构造算法的研究[J]. 中文信息学报,2009,23(4):121-128. [19] 田生伟,吐尔根·依布拉音,禹龙,等. 一种维吾尔语句子相似度算法的研究[J]. 计算机工程与应用, 2009,45(26):144-146. [20] 才让加. 藏语语料库加工方法研究[J]. 计算机工程与应用,2011,47(6): 142-143,150. [21] 才让加. 藏语语料库词语分类体系及标记集研究[J],中文信息学报,2009,23(4): 107-112. [22] Jisong Chen, Rowena Chau. Chung-Hsing Yeh: Discovering Parallel Text from the World Wide Web[C]//ACSW Frontiers 2004: 157-161. [23] Yajuan L, Ming Zhou. Collocation Translation Acquisition Using Monolingual Corpora[C]//42nd Annual Meeting of the Association for Computational Linguistics (ACL), Barcelona, Spain, 2004: 167-174. [24] Donghui Feng, Yajuan L, Ming Zhou. A New Approach for Tnglish-Chinese Named Tntity Alignment[C]//International Conference on Tmpirical Methods in Natural Language Processing (TMNLP), 2004: 372-379. [25] 薛松. 汉英平行语料库中名词短语对齐算法的研究[D]. 中国科学院软件研究所硕士论文,2003,6: 17-32. [26] 常宝宝,柏晓静. 北京大学汉英双语平行语料库标记规范[J].汉语语言与计算学报,2003, 13(2): 195-214. [27] 艾山,吐尔根·依布拉音.英文维文人名机器翻译算法的研究和实现[J].新疆大学学报(自然科学版),2007,24(1):97-101.