刘友强1,李 斌1,2,奚 宁1,陈家骏1. 基于双语平行语料的中文缩略语提取方法[J]. 中文信息学报, 2012, 26(2): 69-75.
LIU Youqiang 1, LI Bin 1,2 , XI Ning 1 , CHEN Jiajun 1. A Bilingual Corpus Based Approach to Chinese Abbreviation Extraction. , 2012, 26(2): 69-75.
A Bilingual Corpus Based Approach to Chinese Abbreviation Extraction
LIU Youqiang 1, LI Bin 1,2 , XI Ning 1 , CHEN Jiajun 1
1. State Key Laboratory for Novel Software Technology at Nanjing University ,Nanjing, Jiangsu 210093, China; 2. Research Center Of Language and Informatics, Nanjing, Jiangsu 210097,China
Abstract:Chinese abbreviations are widely used in modern Chinese texts, and the researches on them are important for Chinese information processing. In this paper, we propose an approach to extract Chinese abbreviations from Chinese-English parallel corpus. First we generate word alignments for the corpus, and extract Chinese-English phrase pairs consistent with the alignments. Then, we discriminate high quality phrase pairs from the bad ones by SVM Classifier. In the end, we extract Chinese abbreviation and full-form phrase pairs from the high quality group using their corresponding English translations and some rules. The experiments show that our approach can extract abbreviations with high accuracy, and could be an effective way to extract Chinese abbreviation and full-form phrase pairs. Key wordsabbreviation; parallel corpus; phrase extraction; classify
[1] Jing-Shin Chang, Yu-Tso Lai. A preliminary study on probabilistic models for Chinese abbreviations[C]//Proceedings of the 3rd SIGHAN Workshop on Chinese Language Processing, 2004, 9-16. [2] Xiaodan Zhu, Mu Li , Jianfeng Gao, et al. Single Character Chinese Named Entity Recognition[C]//Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, ACL, 2003. [3] 李斌,方芳.中文单字国名简称的自动识别[J].计算机工程与应用2006, 42(28): 167-176. [4] 支流,朱学锋,段慧明,等.中文缩略语还原技术初探[C]//全国第八届计算语言学联合学术会议(JSCL-2005). [5] 崔世起,刘群,林守勋等.中文缩略语自动抽取初探[C]//全国第八届计算语言学联合学术会议(JSCL-2005). [6] 武子英,郑家恒.现代汉语缩略语自动识别的方法研究[J].计算机工程与设计2007, 28(16):4052-4054. [7] Zhifei Li, David Yarowsky. Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora[C]//Proceedings of ACL 2008: 425-433. [8] Philipp Koehn, Franz Joseph Och, Daniel Marcu. Statistical Phrase-Based Translation[C]//Proceedings of HLT/NAACL. 2003. [9] F.J.Och, C.Tillmann, H.Ney. Improved alignment models for statistical machine translation[C]//Proceedings of the Joint Conf. of Empirical Methods in Natural Language Processing and Very Large Corpora, 1999, 20-28. [10] V.Vapnik, C.Cortes. Support vector networks[J]. Machine Learning,1995, 20: 273-293. [11] Boxing Chen, George Foster, Roland Kuhn. Bilingual Sense Similarity for Statistical Machine Translation[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics,2010: 834-843.