Off-topic Essays Detection Based on Document Divergence
CHEN Zhipeng1,2, CHEN Wenliang1,2
1. School of Computer Science and Technology, Soochow University,Suzhou,Jiangsu 215006, China; 2.Collaborative Innovation Center of Novel Software Technology and Industrialization, Suzhou,Jiangsu 215006, China
Abstract:Off-topic detection is important in the automated essay scoring systems. Traditional methods compute similarity between essays and then compare the similarity with a fixed threshold to tell whether the essay is off-topic. In fact, the essay score is heavily dependent on the type of topic, e.g. the essay score for divergent topic ranges very different from that of non-divergent topic. This prevents fixed threshold to identify off-topic for all essays. This paper proposes a new method of off-topic detection based on divergence of essays. We study the divergence of essays, and establish the linear regression model between divergence and threshold. Our method is featured by a dynamic threshold for each topic. Experimental results show that our method is more effective than baseline systems.
[1] 陈志鹏, 陈文亮, 朱慕华. 利用词的分布式表示改进作文跑题检测[J]. 中文信息学报, 2015, 29(5): 178-184. [2] A.Huang. Similarity measures for text document clustering[C]//Proceedings of the New Zealand Computer Science Research Student Conference, 2008: 44-56. [3] Kumar N. Approximate string matching algorithm[J].International Journal on Computer Science and Engineering, 2010, 2(3): 641-644. [4] Coelho T A S,Calado P P,Souza L V, et al. Image retrieval using multiple evidence ranking[J]. IEEE Trans on Knowledge and Data Engineering, 2004, 16(4): 408-417. [5] Koy, Park J, Seo J. Improving text categorization using the importance of sentences[J]. Information Processing and Management,2004, 40(1): 65-79. [6] Theobald M,Siddharth J, SpotSigs: robust and efficient near duplicate detection in large web collection[C]//Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2008: 563-570. [7] Christopher D Manning, Prabhakar Raghavan, Hinrich Schütze, Introduction to Information Retrieval[M]. Cambridge University Press,2008: 83-84. [8] Miller G.Wordnet: An On-line Lexical Database[J]. International Journal of Lexicography, 1990, 3(4): 235-244. [9] 颜伟, 荀恩东. 基于WordNet的英语词语相似度计算[C].计算机语言学研讨会论文集. 2004: 89-97. [10] 朱嫣岚, 闵锦, 周雅倩, 等. 基于HowNet的词汇语义倾向计算[J]. 中文信息学报, 2006, 20(1): 14-20. [11] Page E B.Project Essay Grade: PEG[A].In Shermis M D &Burstein J C (eds.). Automated Essay Score: A Cross-Disciplinary Perspective[C]//Proceedings of the NJ: Lawrence Erlbaum Associates,2003: 43-54. [12] Landauer T K, Laham D, Foltz P W. Automated essay scoring and annotation of essays with the Intelligent Essay Assessor. Shermis M D, Burstein J C (eds.). Automated Essay Scoring: A Cross-Disciplinary Perspective[C]//Proceedings of the NJ: Lawrence Erlbaum Associates,2003: 87-112. [13] Burstein J. The E-rater Scoring Engine: Automated essay scoring with natural language processing. In Shermis M D, Burstein J C (eds.). Automated Essay Scoring : A Cross-Disciplinary Perspective[C]//Proceedings of the NJ: Lawrence Erlbaum Associates. 2003 : 113-121. [14] A Louis, D Higgins. Off-topic essay detection using short prompt texts[C]//Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, Los Angeles, California, 2010: 92-95. [15] 葛诗利,陈潇潇.文本聚类在大学英语作文自动评分中应用[J].计算机工程与应用,2009,45(6): 114-121. [16] Tomas Mikolov, Kai Chen, Greg Corrado, et al.Efficient Estimation of Word Representations in Vector Space[C]//Proceedings of Workshop at ICLR, 2013. [17] Tomas Mikolov, Ilya Sutskever, Kai Chen, et al. Distributed Representations of Words and Phrases and their Compositionality[C]//Proceedings of NIPS, 2013. [18] Tomas Mikolov, Wen-tau Yih, Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations[C]//Proceedings of NAACL HLT, 2013: 746-751.