Automatic Extraction of Multi-Word Domain Term in Uyghur Texts
TIAN Shengwei1, ZHONG Jun2, YU Long3
1. School of Software, Xinjiang University, Urumqi, Xinjiang 830008, China; 2. Information Science and Engineering Technology Institute, Xinjiang University, Urumqi, Xinjiang 830046, China; 3. Net Center, Xinjiang University, Urumqi, Xinjiang 830046, China
Abstract:Multi-word domain term extraction is an important issue in natural language processing. Combining the language features of Uyghur, a method of Uyghur multi-word domain terms extraction based on rules and statistics is proposed. The method is divided into four phases: ①corpora pre-processing, including the stop words filtering and part-of-speech(POS) tagging; ②obtaining N-gram substrings as the term candidates, by POS information and calculating internal associative strength via according to the modified mutual information and log likelihood ratio; ③enlarging the term candidates by utilizing the relative frequency difference; ④decide the final terms by C_value. The experimental results show the efficiency of the proposed method with a 85.08% precision and 73.19% recallin Uyghur multi-word domain terms extraction.
[1] Pazienza M T, Pennacchiotti M, Zanzotto F M. Terminology extraction: an analysis of linguistic and statistical approaches[J]. Knowledge Mining, 2005, 185: 255-279. [2] Wendt M, Buscher C, Herta C. Extracting domain terminologies from the world wide web[C]//Proceedings of the Fifth Web as Corpus Workshop (WAC5). San Sebastian, Basque Country, Spain. 2009. [3] Justeson J S, Katz S M. Technical terminology: some linguistic properties and an algorithm for identification in text[J]. Natural Language Engineering, 1995, 1(1): 9-27. [4] 梁颖红, 张文静, 周德富. 基于混合策略的高精度长术语自动抽取[J]. 中文信息学报, 2009, 23(6): 26-30. [5] Gelbukh A, Sidorov G. Automatic term extraction using log-likelihood based comparison with ge- neral reference corpus[C]//Proceedings of Natural Language Processing and Information Systems, 15th International Conference on Applications of Natural Language to Information Systems, Cardiff, UK, 2010. [6] Okamoto M, Kikuchi M, Watanabe N. Semi- automatic evaluation system for supporting term extraction application development[C]//Proceedings of the 2011 Fifth IEEE International Conference on Semantic Computing, Palo Alto, California, USA, IEEE, 2011. [7] Saneifar H, Bonniol S, Laurent A, et al. Terminology extraction from log files[C]//Proceedings of the 20th International Conference on Database and Expert Systems Applications, Linz, Austria, IEEE, 2009. [8] DorjiT C, Atlam E, Yata S, et al. Extraction, selection and ranking of field association(FA) terms from domain-special corpo- ra for building a comprehensive FA terms dictionary[J]. Knowledge and Information Systems. 2011, 27(1): 141-161. [9] 游宏梁, 张巍, 沈钧毅, 刘挺. 一种基于加权投票的术语自动识别方法[J]. 中文信息学报, 2011, 25(3): 9-16. [10] Boulaknadel S, Daille B, Aboutajdine D. A multi-word term extraction program for Arabic language[C]//Proceedings of the the 6th International Conference on Language Resources and Evaluation (LREC), Marrakech, Morocco, 2008. [11] Bounhas I, Slimani Y. A hybrid approach for Arabic multi-Word term extraction[C]//Proceedings of the Natural Language Processing and Knowledge Engineering, Dalian, China, IEEE, 2009. [12] Attia M, Toral A, Tounsi L, et al. Automatic extraction of Arabic multiword expre- ssions[C]//Proceedings of the 7th Conference on Language Re- sources and Evaluation (LREC), Malta, Valletta, 2010. [13] Chen Ji-Song, Chung-Hsing Yeh, R Chau. A multi-word term extraction system[C]//Proceedings of the Trends in artificial intelligence, Lecture Notes in Computer Science, Springer, Berlin, 2006. [14] Sui Zhi-Fang, Hu Yong-Wei, Zhang Hong. An interactive approach to term relation extraction and term extraction[J]. Journal of Computational Information Systems, 2010, 6(1): 229-235. [15] Koeva S. Multi-word term extraction for Bulgarian[C]//Proceedings of the Workshop on Balto- Slavonic Natural Language Processing, Prague, Czechoslovakia, 2007.