网络维吾尔文判别及其文本长度下界的探讨

倪耀群1,2,3,曹 鹏1,2,许洪波1,唐慧丰3,程学旗1

PDF(3783 KB)
PDF(3783 KB)
中文信息学报 ›› 2012, Vol. 26 ›› Issue (6) : 109-116.
综述

网络维吾尔文判别及其文本长度下界的探讨

  • 倪耀群1,2,3,曹 鹏1,2,许洪波1,唐慧丰3,程学旗1
作者信息 +

Uyghur Recognition in Webpages and the Lower Bound of Text Length
for Language Discrimination

  • NI Yaoqun1,2,3, CAO Peng1,2, XU Hongbo1, TANG Huifeng3, CHENG Xueqi1
Author information +
History +

摘要

将维吾尔文从阿拉伯文、哈萨克文、柯尔克孜文等以阿拉伯字母为基础书写的类似文字中识别出来,是维文信息处理的基础。作者对维吾尔字符的编码优化后使用N元语法模型实现了维吾尔文的快速语种判别,准确率超过98%。经过错误分析,发现错误判别的文本主要集中在论坛和微博客中,这些文本有效字符数太少,语言特征不充分。最后作者计算了四种语言真实网络文本中的所有公共子串,并对文种判别所需要的最短字符串长度进行了分析。

Abstract

Distinguishing Uyghur language from similar Arabic script languages such as Arabic, Kazakh, Kirgiz, etc. is an indispensable issue in Uyghur information processing. The paper builts a n-gram based Uyghur language discrimination model over an optimized Uyghur character encoding schema for an accuracy over 98%. The analysis reveals the misestimated texts are centered around the forum posts and microblogs because of their extremely short length (often only a few words). Thus, the paper examines all common sub-strings among tokens appeared in web texts of the four languages and probes into the minimum string length required to determine its language.
Key wordsArabic-Script Uyghur;language detection;longest common substring

关键词

老维文 / 语种识别 / 最大公共子串

Key words

Arabic-Script Uyghur / language detection / longest common substring

引用本文

导出引用
倪耀群1,2,3,曹 鹏1,2,许洪波1,唐慧丰3,程学旗1. 网络维吾尔文判别及其文本长度下界的探讨. 中文信息学报. 2012, 26(6): 109-116
NI Yaoqun1,2,3, CAO Peng1,2, XU Hongbo1, TANG Huifeng3, CHENG Xueqi1. Uyghur Recognition in Webpages and the Lower Bound of Text Length
for Language Discrimination. Journal of Chinese Information Processing. 2012, 26(6): 109-116

参考文献

[1] 薛亚平,袁保社. 全文检索系统中语种识别与索引技术研究[J]. 网络安全技术与应用,2009,(12): 49-51.
[2] 哈力克·尼亚孜,吾买尔·阿皮孜.基础维吾尔语[M].新疆大学, 1995: 1-2.
[3] 瓦热斯江·阿布都克力木.维文Unicode在线处理技术与实现[D].新疆大学硕士研究生学位论文,2002: 17-18.
[4] Imad Saleh, Waris Abdukerim Janbaz. Web Development Considerations for Unicode-based Text Processing in Uyghur Language[C]//Proceedings of the 30th Internationalization and Unicode Conference,November 2006, Washington, DC USA:15-17.
[5] 李继锋,刘群. 基于N-Gram模型的高速汉字编码识别系统[J].计算机工程与应用,2004,(3): 39-42.
[6] Shanjian Li, Katsuhiko Momoi. A composite approach to language/encoding detection[OL], http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html.
[7] Seungbeom Kim, Jongsoo Park. Automatic Detection of Character Encoding and Language[R],CS 229, Machine Learning, Autumn 2007, Stanford University.
[8] 张健,任炜,蒋欣,等. 多语种eml文件编码及语种识别算法研究[J]. 新疆大学学报(自然科学版),2010,27(4): 482-485.
[9] 冯冲,黄河燕,陈肇雄,等. 基于字符层马尔科夫模型的多语种识别[J].计算机科学,2006,33(1): 226-235.
[10] 曹鹏,李静远,满彤,等. Twitter中近似重复消息的判定方法研究[J]. 中文信息学报,2011,25(1): 20-27.

基金

国家自然科学基金资助项目(60903139,60873243);自然基金重点资助项目(60933005);国家863计划重点资助项目(2010AA012502,2010AA012503)。
PDF(3783 KB)

Accesses

Citation

Detail

段落导航
相关文章

/