利用交叉分类机制共享因特网上各种语言的信息资源是知识挖掘的重要方法,本文给出了双语交叉分类的模型以及实现方法。其主要思想是不需要进行机器翻译和人工标注,利用文本特征抽取机制提取类别特征项和文本特征项,通过基于概念扩充的对译映射规则自动生成类别和文本特征向量,在此基础上利用潜在语义分析,将双语文本在语义层面上统一起来,通过类别与文本的语义相似度进行分类。从而获取较高的精度。
Abstract
It is essential to knowledge discovery that multi-linguistic text categorization is applied to share the information sources in the Internet . The model for bi-linguistic text categorization is presented in this paper. It utilizes the mechanism of text feature extraction to extract the features of classes and texts ,and it generates the feature vectors of classes and texts by the rule of word translation based on concept expansion. As a result ,it uses Latent Semantic Indexing to integrate the bi-linguistic texts on the semantic layer ,and it calculates the semantic similarity between texts and classes to classify the texts. It can make high categorization precision ,and it is independent of machine translation and manual tagging.
关键词
双语交叉文本分类 /
概念扩充 /
潜在语义分析 /
空间向量模型
{{custom_keyword}} /
Key words
bi-linguistic text categorization /
conceptual expansion /
latent semantic indexing /
vectorpace model
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Akira Maeda ,Fatiha Sadat ,Masatoshi Yoshikawa and Shunsuke Uemura ,Query Term Disambiguation For Web Cross - Language Information Retrieval Using A Search Engine ,In Proceedings of the 5th International Workshop on Information Retrieval With Asian Language (IRAL2000) ,25 - 32 ,Hong Kong ,China ,2000
[2] David D. Lewis ,Challenge in Machine Learning for Text Classification ,in Proceedings of the Ninth Annual Conference on Computational Learning Theory ,Desenzano del Garda ,Italy ,1996 ,Http://www.research.att.com/lewis
[3] Douglas W. Oard ,Adaptive Vector Space Text Filtering for Monolingual and Cross-Language Application ,1996 ,http://www.ee.umd.edu/medlab/filter/papers.ps.gz
[4] Ruth Sperer and Douglas W. Oard ,Structured Translation for Cross - Language Information Retrieval , In the Proceedings of the 23rd annual International ACM SIGIR Conference on Research and Development in Information Retrieval ,120 - 127 ,Athens ,Greece ,2000
[5] Akira Maeda ,Studies on Multilingual Information Processing on the Internet ,PhD Thesis ,Nara Institute of Science and Technology ,2000
[6] Bonnie J . Dorr and Douglas W. Oard ,Evaluating resources for query translation in cross - language information retrieval. In Proceedings of the First International Conference on Language Resource Evaluation ,1998 ,http://www.glue.umd.edu/oard/research.html
[7] Douglas W. Oard and Jianqiang Wang , Effects of Term Segmentation on Chinese/English Cross-Language Information retrieval , In Proceedings of the Symposium on String Processing and Information Retrieval ,1999. http://www.glue.umd.edu/oard/research.html
[8] 林鸿飞,李业丽,姚天顺. 中英文双语交叉过滤的逻辑模型. 计算机工程与应用,2000 , (36) 8 :48 - 50
[9] 吴立德. 大规模中文文本处理. 上海:复旦大学出版社,1997
[10] 林鸿飞,战学钢,姚天顺. 基于概念的文本分析方法. 计算机研究与发展,2000 , (37) 3 :324 - 328
[11] 林鸿飞,麻志毅,姚天顺.基于语义框架的中文文本过滤模型.计算机研究与发展,2001 ,(38)增刊:136 - 141
[12] 姚天顺. 自然语言理解. 北京:清华大学出版社,1995
[13] Peter W. Foltz ,Latent Semantic Analysis for Text-Based Research ,Behavior Research Methods , Instruments and Computers ,Vol.28 ,No.2 ,1996 ,197 - 202
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}