一种改进的社交媒体文本规范化方法

宋亚军,于中华,陈 黎,丁革建,罗 谦

PDF(2585 KB)
PDF(2585 KB)
中文信息学报 ›› 2015, Vol. 29 ›› Issue (5) : 104-112.
信息抽取与文本挖掘

一种改进的社交媒体文本规范化方法

  • 宋亚军1,于中华1,陈 黎1,丁革建2,罗 谦3
作者信息 +

An Improving Method for Social Media Text Normalization

  • SONG Yajun1, YU Zhonghua1, CHEN Li1, DING Gejian2, LUO Qian3
Author information +
History +

摘要

社交媒体具有文本不规范的特点,现有自然语言处理工具直接应用于社交媒体文本时效果不甚理想,并且基于

Abstract

The informal style of social media texts challenges many natural language processing tools, including many keyword-based methods proposed for social media textTherefore, the normalization of the social media text is indispensable. Based on the assumption of context similarity between the lexical variants, we proposed an improved graph-based social media text normalization method by introducing word embedding model to better capture the context similarity. As an unsupervised and language independent method, it can be used to process large-scale social media texts of various languages. Experimental results show that the proposed method outperforms the of previous methods with the best F-score.

关键词

社交媒体 / 文本规范化 / 自然语言处理 / 词嵌入

Key words

social media / text normalization / natural language process / word embedding

引用本文

导出引用
宋亚军,于中华,陈 黎,丁革建,罗 谦. 一种改进的社交媒体文本规范化方法. 中文信息学报. 2015, 29(5): 104-112
SONG Yajun, YU Zhonghua, CHEN Li, DING Gejian, LUO Qian. An Improving Method for Social Media Text Normalization. Journal of Chinese Information Processing. 2015, 29(5): 104-112

参考文献

[1] ARitter,CCherry,B Dolan. Unsupervised modeling of twitter conversations[C]//Proceedings of the Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2010:172-180.
[2] O Owoputi, B OConnor,C Dyer,et.al. Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters[C]//Proceedings of the Human Language Technologies : Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2013: 380-390.
[3] K Gimpel, N Schneider, B OConnor, et.al. Part-of-speech Tagging for Twitter: Annotation, Features, and Experiments[C]//Proceedings of the Human Language Technologies: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics,2011:42-47.
[4] E Brill, R C Moore. An improved error model for noisy channel spelling correction[C]//Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Englewood Cliffs, NJ, USA,2000: 286-293.
[5] K Toutanova, R C Moore. Pronunciation modeling for improved spelling correction[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL, Philadelphia, USA, 2002: 144-151.
[6] M Choudhury, R Saraf, V Jain, et.al. Investigation and modeling of the structure of texting language[J]. International Journal of Document Analysis and Recognition, 2007,10: 157-174.
[7] P Cook, S Stevenson. An unsupervised model for text message normalization[C]//Proceedings of the Workshop on Computational Approaches to Linguistic Creativity, Boulder, USA. 2009: 71-78.
[8] A Aw, M Zhang, J Xiao. A phrase-based statistical model for SMS text normalization[C]//Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 2006: 33- 40.
[9] D Pennell, Y Liu. A Character-Level Machine Translation Approach for Normalization of SMS Abbreviations[C]//Proceedings of Fifth International Joint Conference on Natural Language Processing, 2011: 974-982.
[10] Y Yang, J Eisenstein. A Log-Linear Model for Unsupervised Text Normalization[C]//Proceedings of the Empirical Methods on Natural Language Processing, 2013: 61-72
[11] B Han, T Baldwin. Lexical Normalization of Short Text Messages: Makn Sens a #Twitter[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies,2011,1: 368-378.
[12] S Gouws, S Metzler, C Cai, et al. Contextual Bearing on Linguistic Variation in Social Media[C]//Proceedings of the Workshop on Languages in Social Media, 2011: 20-29.
[13] B Han, P Cook, T Baldwin. Automatically constructing a normalisation dictionary for microblogs[C]//Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012: 421-432.
[14] H Hassan, A Menezes. Social Text Normalization Using Contextual Graph Random Walks[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 2013: 1577-1586.
[15] C Snmez, A Ozgür. A Graph-based Approach for Contextual Text Normalization[C]//Proceeding of Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014: 313-324.
[16] Y Bengio, R Ducharme Vincent, C Jauvin. A neural probabilistic language model[J]. The Journal of Machine Learning Research, 2003, 3: 1137-1155.
[17] A Mnih, G E Hinton. A scalable hierarchical distributed language model[J]. Advances in neural information processing systems, 2009, 21: 1081-1088.
[18] T Mikolov, A Deoras, D Povey, et al. Strategies for Training Large Scale Neural Network Language Models[C]//Proceedings of the Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on IEEE, 2011:196-201.
[19] T Mikolov, W Yih, G Zweiq. Linguistic kegularities in Continaous Space Word Representations[C]//Proceedings of the HLT-NAACL.2013.
[20] D Contractor, T Faruquie, V Subramaniam. Unsupervised cleansing of noisy text[C]//Proceedings of the 23rd International Conference on Computational Linguistics, 2010: 189-196.
[21] J Yang, J Leskovec. Patterns of Temporal Variation in Online Media[C]//Proceedings of the 4th International Conference on Web Search and Web Data Mining, 2011: 177-186.
[22] M Lui, T Baldwin. Langid.Py: An Off-the-shelf Language Identification Tool[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2012: 25-30.
[23] T Baldwin, M Lui. Language Identification: The Long and the Short of the Matter[C]//Proceedings of the Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2010: 229-237.
[24] T Mikolov, I Sutskever, K Chen, et al. Distributed representations of words and phrases and their compositionality[J]. Advances in Neural Information Processing Systems.2013,3: 3111-3119.
[25] Q Le, T Mikolov. Distributed Representations of Sentences and Documents[C]//Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014: 1188-1196.
[26] AStolcke. SRILM-an extensible language modeling toolkit[C]//Proceedings of the Interspeech. 2002: 901-904.

基金

浙江省自然科学基金(LY12F02010);四川省科学支撑项目(2014GZ0063)
PDF(2585 KB)

682

Accesses

0

Citation

Detail

段落导航
相关文章

/