将维吾尔文从阿拉伯文、哈萨克文、柯尔克孜文等以阿拉伯字母为基础书写的类似文字中识别出来,是维文信息处理的基础。作者对维吾尔字符的编码优化后使用N元语法模型实现了维吾尔文的快速语种判别,准确率超过98%。经过错误分析,发现错误判别的文本主要集中在论坛和微博客中,这些文本有效字符数太少,语言特征不充分。最后作者计算了四种语言真实网络文本中的所有公共子串,并对文种判别所需要的最短字符串长度进行了分析。
Abstract
Distinguishing Uyghur language from similar Arabic script languages such as Arabic, Kazakh, Kirgiz, etc. is an indispensable issue in Uyghur information processing. The paper builts a n-gram based Uyghur language discrimination model over an optimized Uyghur character encoding schema for an accuracy over 98%. The analysis reveals the misestimated texts are centered around the forum posts and microblogs because of their extremely short length (often only a few words). Thus, the paper examines all common sub-strings among tokens appeared in web texts of the four languages and probes into the minimum string length required to determine its language.
Key wordsArabic-Script Uyghur;language detection;longest common substring
关键词
老维文 /
语种识别 /
最大公共子串
{{custom_keyword}} /
Key words
Arabic-Script Uyghur /
language detection /
longest common substring
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 薛亚平,袁保社. 全文检索系统中语种识别与索引技术研究[J]. 网络安全技术与应用,2009,(12): 49-51.
[2] 哈力克·尼亚孜,吾买尔·阿皮孜.基础维吾尔语[M].新疆大学, 1995: 1-2.
[3] 瓦热斯江·阿布都克力木.维文Unicode在线处理技术与实现[D].新疆大学硕士研究生学位论文,2002: 17-18.
[4] Imad Saleh, Waris Abdukerim Janbaz. Web Development Considerations for Unicode-based Text Processing in Uyghur Language[C]//Proceedings of the 30th Internationalization and Unicode Conference,November 2006, Washington, DC USA:15-17.
[5] 李继锋,刘群. 基于N-Gram模型的高速汉字编码识别系统[J].计算机工程与应用,2004,(3): 39-42.
[6] Shanjian Li, Katsuhiko Momoi. A composite approach to language/encoding detection[OL], http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html.
[7] Seungbeom Kim, Jongsoo Park. Automatic Detection of Character Encoding and Language[R],CS 229, Machine Learning, Autumn 2007, Stanford University.
[8] 张健,任炜,蒋欣,等. 多语种eml文件编码及语种识别算法研究[J]. 新疆大学学报(自然科学版),2010,27(4): 482-485.
[9] 冯冲,黄河燕,陈肇雄,等. 基于字符层马尔科夫模型的多语种识别[J].计算机科学,2006,33(1): 226-235.
[10] 曹鹏,李静远,满彤,等. Twitter中近似重复消息的判定方法研究[J]. 中文信息学报,2011,25(1): 20-27.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金资助项目(60903139,60873243);自然基金重点资助项目(60933005);国家863计划重点资助项目(2010AA012502,2010AA012503)。
{{custom_fund}}