利用词的分布式表示改进作文跑题检测

陈志鹏,陈文亮,朱慕华

PDF(2439 KB)
PDF(2439 KB)
中文信息学报 ›› 2015, Vol. 29 ›› Issue (5) : 178-185.
自然语言处理应用

利用词的分布式表示改进作文跑题检测

  • 陈志鹏1,2,陈文亮1,2,朱慕华3
作者信息 +

Exploiting Distributed Representation of Words for Better Off-Topic Essay Detection

  • CHEN Zhipeng1,2, CHEN Wenliang1,2,ZHU Muhua3
Author information +
History +

摘要

作文跑题检测任务的核心问题是文本相似度计算。传统的文本相似度计算方法一般基于向量空间模型,即把文本表示成高维向量,再计算文本之间的相似度。这种方法只考虑文本中出现的词项(词袋模型),而没有利用词项的语义信息。该文提出一种新的文本相似度计算方法:基于词扩展的文本相似度计算方法,将词袋模型(Bag-of-Words)方法与词的分布式表示相结合,在词的分布式表示向量空间中寻找与文本出现的词项语义上相似的词加入到文本表示中,实现文本中单词的扩展。然后对扩展后的文本计算相似度。该文将这种方法运用到英文作文的跑题检测中,构建一套跑题检测系统,并在一个真实数据中进行测试。实验结果表明该文的跑题检测系统能有效识别跑题作文,性能明显高于基准系统。

Abstract

Similarity measure is the core component of off-topic essays detection. To compute the text similarity, the bag-of-words model is widely used, which represents a text as a vector with each dimension corresponds to a word. To further capture the word semantic information, this paper proposes a new method to compute text similarity: a method exploits word distributed representation. The proposed method combines the traditional bag-of-words model with the word semantic information. For each word in a text, we search for a set of similar words in a text collection, and then extend the text vector with these words. Finally we compute text similarity with the updated text. Experimental results show that our method is more effective than baseline systems.

关键词

文本相似度 / 词分布式表示 / 跑题检测 / 文本表示

Key words

text similarity / word distributed representation / digress test / text representation

引用本文

导出引用
陈志鹏,陈文亮,朱慕华. 利用词的分布式表示改进作文跑题检测. 中文信息学报. 2015, 29(5): 178-185
CHEN Zhipeng, CHEN Wenliang,ZHU Muhua. Exploiting Distributed Representation of Words for Better Off-Topic Essay Detection. Journal of Chinese Information Processing. 2015, 29(5): 178-185

参考文献

[1] D Higgins, J Burstein Attali. Identifying off-topic student essays without topic-specific training data[J], Natural Language Engineering, 2006, 12(2): 145-159.
[2] A Huang. Similarity measures for text document clustering[C]//Proceedings of the New Zealand Computer Science Research Student Conference, 2008: 44-56.
[3] KUMAR N. Approximate string matching algorithm [J].International Journal on Computer Science and Engineering, 2010, 2(3): 641-644.
[4] COELHO T A S,CALADO P P,SOUZA L V, 等. Image retrieval using multiple evidence ranking[J]. IEEE Trans on Knowledge and Data Engineering, 2004, 16(4): 408-417.
[5] KOY, PARK J, SEO J. Improving text categorization using the importance of sentences[J]. Information Processing and Management,2004, 40(1): 65-79.
[6] THEOBALD M,SIDDHARTH J, SpotSigs: robust and efficient near duplicate detection in large web collection[C]//Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2008: 563-570.
[7] Miller G.Wordnet:An On-line Lexical Database[J]. International Journal of Lexicography, 1990, 3(4): 235-244.
[8] 颜 伟, 荀恩东. 基于WordNet的英语词语相似度计算[C]//计算机语言学研讨会论文集. 2004.
[9] 朱嫣岚, 闵锦, 周雅倩, 等. 基于HowNet的词汇语义倾向计算[J]. 中文信息学报, 2006, 20(1):14-20.
[10] Lee, Daniel D, H. Sebastian Seung. Algorithms for non-negative matrix factorization[C]//Proceedings of the Advance in Neural Information Processing System.MIT Press,2001:556-562.
[11] 张霞, 王建东, 顾海花. 一种改进的页面相似性度量方法[J]. 计算机工程与应用, 2010, 46(19): 141-144.
[12] Sánchez J A, Medina M A, Starostenko O, 等.Organizing Open Archives via Lightweight Ontolog to Facilitate the Use of Heterogeneous Collection[J]. Aslib Proceedings, 2012, 64(1): 46-66.
[13] Vicient C, Sánchez D, Moreno A. An Automatic Approach for Ontology-Based Feature Extraction from Heterogeneous Documental Resource[J]. Engineering Application of Artificial Intelligence, 2013, 26: 1092-1106.
[14] Liu Q, Li S J. Semantic Similarity Calculation Based on HowNet [C]//Proceedings of the 3rd Chinese Lexical Semantics Workshop. Taipei, China, 2002: 59-76.
[15] Ramage D, Rafferty A N, Manning C D. Random walks for text semantic similarity[C]//Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing. Suntec, Singapore, 2009: 23-31.
[16] A Louis, D Higgins. Off-topic essay detection using short prompt texts[C]//Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, Los Angeles, California, 2010:92-95.
[17] Y Bengio, R Ducharme, P Vincent, et al.A neural probabilistic language model[J]. Journal of Machine Learning Research, 3:1137-1155.
[18] Tomas Mikolov, Kai Chen, Greg Corrado, et al. Efficient Estimation of Word Representations in Vector Space[C]//Proceedings of Workshop at ICLR, 2013.
[19] Tomas Mikolov, Ilya Sutskever, Kai Chen, et al. Distributed Representations of Words and Phrases and their Compositionality[C]//Proceedings of NIPS, 2013.

基金

国家自然科学基金(61203314, 61333018)
PDF(2439 KB)

Accesses

Citation

Detail

段落导航
相关文章

/