该文以维吾尔语和哈萨克语这一组相近语言为例,在哈语语料受限的情况下,使用领域外语料增补原始语料,经同化后提高了在口语风格短文本上进行语种识别的精确度。该文分析了维、哈两种语言的词形学特点,设计了多种特征,构建了一个最大熵分类器,在测试集上识别维语和哈语口语风格短文本的精确度达到95.7%,而CNN分类器的精确度仅为69.1%。实验结果证明该系统对其他语种口语风格短文本的语种识别亦具有适用性。
Abstract
This paper aims at identification similar languages such as Uyghur and Kazakh from short conversational texts. To alleviate the severe data imbalance resulted from the low-recource Kazakh, we leverage a compensation strategy and an assimilation method by selecting appropriate out-of-domain data. Then we constructed a maximum entropy MaxEnt classifier based on morphologic features to discriminate between the two languages and investigated the contribution of each feature. Experimental results suggest that the MaxEnt classifier effectively discriminates between Uyghur and Kazakh on the test set with an accuracy of 95.7%, outperforming the champion of the VarDial’2016 DSL shared task on test sets B1 and B2 by 0.6% and 1.2%.
关键词
相近语种识别 /
领域外数据 /
口语风格短文本 /
字符的形态学特征
{{custom_keyword}} /
Key words
similar language identification /
out-of-domain data /
short conversational texts /
morphological features
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Cavnar W B, Trenkle J M. N-gram-based text categorization[C]//Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval. Las Vegas, 1994: 161-175.
[2] Simes A, Almeida J J, Byers S D. Language identification: A neural network approach[C]//Proceedings of the 3rd Symposium on Languages, Applications and Technologies, SLATE14. Dagstuhl, 2014: 252-265.
[3] Brown R. Non-linear mapping for improved identification of 1300+ languages[C]//Proceedings of Conference on Empirical Methods in Natural Language Processing, 2014: 627-632.
[4] Ljubeic' N, Kranjcic D. Discriminating between very similar languages among twitter users[C]//Proceedings of 9th Language Technology Conference Information Society-IS 2014, 2014: 90-94.
[5] Zampieri M, et al. Overview of the DSL shared task 2015[C]//Proceedings of LT4VarDial Workshop, 2015.
[6] 王玲, 达瓦·伊德木草, 吾守尔·斯拉木. 维哈柯及蒙语多文种语言相似性考查研究[J]. 中文信息学报, 2013, 27(6): 180-187.
[7] Ranaivo-Malancon B. Automatic identification of close languages-case study: Malay and Indonesian[C]//Proceedings of ECTI Transactions on Computer and Information Technology, 2006: 126-134.
[8] Ljubeic' N, Mikelic N, Boras D. Language identification: How to distinguish similar languages?[C]//Proceedings of the 29th International Conference on Information Technology Interfaces, 2007.
[9] Tiedemann J, Ljubeic' N. Efficient discrimination between closely related languages[C]//Proceedings of COLING 2012, 2012: 2619-2634.
[10] Huang C R, Lee, L H. Contrastive approach towards text source classification based on top-bag-of-word similarity[C]//Proceedings of PACLIC 2008, 2008: 404-410.
[11] Zampieri M, Gebre B G. Automatic identification of language varieties: The case of Portuguese[C]//Proceedings of KONVENS 2012. Vienna, 2012: 233-237.
[12] Zampieri M, Gebre B G, Diwersy S. N-gram language models and POS distribution for the identification of Spanish varieties[C]//Proceedings of TALN2013. Sable dOlonne, 2013: 580-587.
[13] Lui M, Cook P. Classifying English documents by national dialect[C]//Proceedings of Australasian Language Technology Workshop, 2013: 5-15.
[14] Zaidan O F, Callison-Burch C. Arabic dialect identification[J]. Computational Linguistics, 2013.
[15] Tan L, et al. Merging comparable data sources for the discrimination of similar languages: The DSL corpus collection[C]//Proceedings of the Workshop on Building and Using Comparable Corpora (BUCC). Reykjavik, 2014.
[16] Zampieri M, et al. A report on the DSL shared task 2014[C]//Proceedings of the 1st Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects. Dublin,2014: 58-67.
[17] Zampieri M, et al. Findings of the VarDial evaluation campaign 2017[C]//Proceedings of the 4th Workshop on NLP for Similar Languages, Varieties and Dialects. Valencia,2017: 1-15.
[18] Malmasi S, et al. Discriminating between similar languages and Arabic dialect identification: A report on the third DSL shared task[C]//Proceedings of the third Workshop on NLP for Similar Languages, Varieties and Dialects. Osaka,2016: 1-14.
[19] Phan X H, Nguyen L M, Horiguchi S. Learning to classify short and sparse text & web with hidden topics from large-scale data collections[C]//Proceedings of WWW 2008, Beijing, 2008: 91-100.
[20] Rˇehurˇek R, Kolkus M. Language identification on the web: Extending the dictionary method[C]//Proceedings of International Conference on Intelligent Text Processing and Computational Linguistics. Heidelberg, 2009: 357-368.
[21] Tromp E, Pechenizkiy M. Graph-based n-gram language identification on short texts[C]//Proceedings of the 20th Machine Learning Conference of Belgium and the Netherlands,2011: 27-34.
[22] Dai Z, Sun A, Liu X Y. Crest: Cluster-based representation enrichment for short text classification[C]//Proceedings of PAKDD 2013,2013: 256-267.
[23] Zubiaga A, et al. Overview of TweetLID: Tweet language identification at SEPLN 2014[C]//Proceedings of the Tweet Language Identification Workshop, TweetLID2014. Girona, 2014: 1-11.
[24] Iyer R, Ostendorf M, Gish H. Using out-of-domain data to improve in-domain language models[J]. IEEE Signal Processing Letters, 1997,4(8): 221-223.
[25] Haddow B, Koehn P. Analysing the effect of out-of-domain data on SMT systems[C]//Proceedings of the Workshop on Statistical Machine Translation. Association for Computational Linguistics, 2012: 422-432.
[26] He H, Garcia E A. Learning from imbalanced data[J]. IEEE Transactions on Knowledge & Data Engineering, 2009,21(9): 1263-1284.
[27] 买买提依明·哈斯木, 吾守尔·斯拉木, 维尼拉·木沙江,等. 基于统计专用字符的维、哈、柯文文种识别研究[J]. 中文信息学报, 2015, 29(2): 111-117.
[28] Krizhevsky A, Sutskever I, Hinton, G E. ImageNet classification with deep convolutional neural networks[C]//Proceedings of International Conference on Neural Information Processing Systems. Curran Associates Inc,2012(25): 1097-1105.
[29] Kim Y. Convolutional neural networks for sentence classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014: 1746-1751.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(11590771,11590770-4,11722437,61650202,U1536117,61671442,11674352,11504406,61601453);国家重点研发计划(2016YFB0801203,2016YFC0800503,2017YFB1002803);新疆维吾尔自治区重大科技专项(2016A03007-1);中国科学院声学研究所青年英才计划(QNYC201603)
{{custom_fund}}