作文跑题检测是作文自动评分系统的重要模块。传统的作文跑题检测一般计算文章内容相关性作为得分,并将其与某一固定阈值进行对比,从而判断文章是否跑题。但是实际上文章得分高低与题目有直接关系,发散性题目和非发散性题目的文章得分有明显差异,所以很难用一个固定阈值来判断所有文章。该文提出一种作文跑题检测方法,基于文档发散度的作文跑题检测方法。该方法的创新之处在于研究文章集合发散度的概念,建立发散度与跑题阈值的关系模型,对于不同的题目动态选取不同的跑题阈值。该文构建了一套跑题检测系统,并在一个真实的数据集中进行测试。实验结果表明基于文档发散度的作文跑题检测系统能有效识别跑题作文。
Abstract
Off-topic detection is important in the automated essay scoring systems. Traditional methods compute similarity between essays and then compare the similarity with a fixed threshold to tell whether the essay is off-topic. In fact, the essay score is heavily dependent on the type of topic, e.g. the essay score for divergent topic ranges very different from that of non-divergent topic. This prevents fixed threshold to identify off-topic for all essays. This paper proposes a new method of off-topic detection based on divergence of essays. We study the divergence of essays, and establish the linear regression model between divergence and threshold. Our method is featured by a dynamic threshold for each topic. Experimental results show that our method is more effective than baseline systems.
关键词
跑题检测 /
文档发散度 /
文本相似度
{{custom_keyword}} /
Key words
off-topic detection /
document divergence /
document similarity
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 陈志鹏, 陈文亮, 朱慕华. 利用词的分布式表示改进作文跑题检测[J]. 中文信息学报, 2015, 29(5): 178-184.
[2] A.Huang. Similarity measures for text document clustering[C]//Proceedings of the New Zealand Computer Science Research Student Conference, 2008: 44-56.
[3] Kumar N. Approximate string matching algorithm[J].International Journal on Computer Science and Engineering, 2010, 2(3): 641-644.
[4] Coelho T A S,Calado P P,Souza L V, et al. Image retrieval using multiple evidence ranking[J]. IEEE Trans on Knowledge and Data Engineering, 2004, 16(4): 408-417.
[5] Koy, Park J, Seo J. Improving text categorization using the importance of sentences[J]. Information Processing and Management,2004, 40(1): 65-79.
[6] Theobald M,Siddharth J, SpotSigs: robust and efficient near duplicate detection in large web collection[C]//Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 2008: 563-570.
[7] Christopher D Manning, Prabhakar Raghavan, Hinrich Schütze, Introduction to Information Retrieval[M]. Cambridge University Press,2008: 83-84.
[8] Miller G.Wordnet: An On-line Lexical Database[J]. International Journal of Lexicography, 1990, 3(4): 235-244.
[9] 颜伟, 荀恩东. 基于WordNet的英语词语相似度计算[C].计算机语言学研讨会论文集. 2004: 89-97.
[10] 朱嫣岚, 闵锦, 周雅倩, 等. 基于HowNet的词汇语义倾向计算[J]. 中文信息学报, 2006, 20(1): 14-20.
[11] Page E B.Project Essay Grade: PEG[A].In Shermis M D &Burstein J C (eds.). Automated Essay Score: A Cross-Disciplinary Perspective[C]//Proceedings of the NJ: Lawrence Erlbaum Associates,2003: 43-54.
[12] Landauer T K, Laham D, Foltz P W. Automated essay scoring and annotation of essays with the Intelligent Essay Assessor. Shermis M D, Burstein J C (eds.). Automated Essay Scoring: A Cross-Disciplinary Perspective[C]//Proceedings of the NJ: Lawrence Erlbaum Associates,2003: 87-112.
[13] Burstein J. The E-rater Scoring Engine: Automated essay scoring with natural language processing. In Shermis M D, Burstein J C (eds.). Automated Essay Scoring : A Cross-Disciplinary Perspective[C]//Proceedings of the NJ: Lawrence Erlbaum Associates. 2003 : 113-121.
[14] A Louis, D Higgins. Off-topic essay detection using short prompt texts[C]//Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, Los Angeles, California, 2010: 92-95.
[15] 葛诗利,陈潇潇.文本聚类在大学英语作文自动评分中应用[J].计算机工程与应用,2009,45(6): 114-121.
[16] Tomas Mikolov, Kai Chen, Greg Corrado, et al.Efficient Estimation of Word Representations in Vector Space[C]//Proceedings of Workshop at ICLR, 2013.
[17] Tomas Mikolov, Ilya Sutskever, Kai Chen, et al. Distributed Representations of Words and Phrases and their Compositionality[C]//Proceedings of NIPS, 2013.
[18] Tomas Mikolov, Wen-tau Yih, Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations[C]//Proceedings of NAACL HLT, 2013: 746-751.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(61572338)
{{custom_fund}}