马志强,张泽广,闫 瑞,刘利民,冯永祥,苏依拉. 基于N-Gram模型的蒙古语文本语种识别算法的研究[J]. 中文信息学报, 2016, 30(1): 133-140.
MA Zhiqiang, ZHANG Zeguang, YAN Rui, LIU Limin, FENG Yongxiang, SU Yila. N-Gram Based Language Identification for Mongolian Text. , 2016, 30(1): 133-140.
基于N-Gram模型的蒙古语文本语种识别算法的研究
马志强,张泽广,闫 瑞,刘利民,冯永祥,苏依拉
内蒙古工业大学 信息工程学院,内蒙古 呼和浩特 010080)
N-Gram Based Language Identification for Mongolian Text
MA Zhiqiang, ZHANG Zeguang, YAN Rui, LIU Limin, FENG Yongxiang, SU Yila
School of Information Engineering, Inner Mongolia University of Technology, Hohhot, Inner Mongolia 010080, China
Abstract:With the rapid increasing of Mongolian texts on the Internet, it is of practical significance to identify them before further processing. This paper proposes an average distance recognition algorithm based on N-Gram model, and an experimental platform is established. Experimental results show that the presented algorithm can identify Mongolian text from Chinese, English, or even mixed-language texts, with an accuracy of above 99.5%.
[1] 金良,散旦玛,玉英.传统蒙古文编码及其应用现状分析[J].语文学刊,2012,4:16-18.
[2] 清格尔泰.现代蒙古语语法[M].呼和浩特: 内蒙古人民出版社,1999.
[3] Denis Shestakov. Current Challenges in Web Crawling[C]//Proceedings of the 13th International Conference. ICWE 2013:518-521 .
[4] 倪耀群,曹鹏,许洪波等.网络维吾尔文判别及其文本长度下界的探讨[J].中文信息学报, 2012, 26(6):109-115.
[5] 冯冲, 黄河燕, 陈肇雄等. 基于字符层马尔科夫模型的多语种识别[J].计算机科学,2006, 33(1): 226-228.
[6] Cavnar, William B, John M. Trenkle. N-Gram-based text categorization[J]. Ann Arbor MI 48113.2 (1994): 161-175.
[7] 付强,宋彦,戴礼荣. 因子分析在基于GMM的自动语种识别中的应用[J]. 中文信息学报,2009,23(4):77-81.
[8] 刘伟伟,吉立新,李邵梅,何赞园. 基于区分加权干扰属性投影的语种识别方法[J]. 中文信息学报,2012,26(6):59-64.
[9] 朱云霞. 结合聚类思想神经网络文本分类技术研究[J].计算机应用研究, 2012,29(1): 155-157.
[10] 刘巍巍,张卫强,刘加. 基于鉴别性向量空间模型的语种识别[J].清华大学学报, 2013, 53(6): 796-799.
[11] 李惠,刘颖. 基于语言模型和特征分类的抄袭判定[J].计算机工程, 2013, 39(5):230-234.
[12] 张泽华, 苗夺谦, 钱进. 邻域粗糙化的启发式重叠社区扩张方法[J]. 计算机学报, 2013, 36(10): 2078-2086.
[13] Grefenstette G. Comparing Two Language Identification Schemes[C]//Proceedings of the 3th International Conference on Statistical Analysis of Textual Data, Rome, Italy. 1995.
[14] Pingali P, Varma V. Multi-lingual Indexing Support for CLIR using Language Modeling[J]. IEEE Data Eng. Bull., 2007, 30(1): 70-85.
[15] Nguyen D T, Nguyen C T. Cross-Lingual Information Retrieval Model for Vietnamese-English Web Sites[C]//Proceedings of the Computer Modeling and Simulation, 2010. ICCMS10. Second International Conference on. IEEE, 2010, 4: 254-257.
[16] Malisiewicz T, Gupta A, Efros A A. Ensemble of exemplar-svms for object detection and beyond[C]//Proceedings of the Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011: 89-96.
[17] Chelba C. Exploiting syntactic structure for natural language modeling[J]. Johns Hopkins Universit, 2000: 225-231.
[18] Lipka, Nedim, and Benno Stein. Identifying featured articles in Wikipedia: writing style matters. Proceedings of the 19th international conference on World wide web[C]//Proceedings of the ACM, 2010: 1147-1148.