A Survey on the Automatic Text Readability Measures
WU Siyuan1,2, CAI Jianyong2,3, YU Dong1, JIANG Xin2
1.College of Information Science, Beijing Language and Culture University, Beijing 100083, China; 2.Center for Studies of Chinese as a Second Language, Beijing Language and Culture University, Beijing 100083, China; 3.College of Intensive Chinese Studies, Beijing Language and Culture University, Beijing 100083, China
Abstract:The concept of readability is originally proposed by educators to assist the selection of suitable reading materials for learners. This paper surveys the existing works on automatic text readability measures, and summarized three types of methods: formula-based method, classification method and ranking method. This paper also outlines the databases and the extracted features in the literature. And finally, the future developments of the automatic readability research is provided.
[1] Michael B W Wolfe,et al.Learning from text: Matching readers and texts by latent semantic analysis[J].Discourse Processes,1998,25(2-3):309-336. [2] 王蕾.可读性公式的内涵及研究范式——兼议对外汉语可读性公式的研究任务[J].语言教学与研究,2008,(6):46-53. [3] Vogel M,Washburne C.An objective method of determining grade placement of children’s reading material[J].Elementary School Journal,1928,28(5):373-381. [4] Sheehan K M,Kostin I,Napolitano D,et al.The TextEvaluator tool: Helping teachers and test developers select texts for use in instruction and assessment [J].Elementary School Journal,2014,115(2):184-209. [5] 郭曙纶.试论对外汉语教材中的超纲词[J].宁夏大学学报(人文社会科学版),2008,30(4):25-29. [6] Sato S.Automatic assessment of Japanese text readability based on a textbook corpus[J].Proc of Lrec08 Marrakech Morocco,2008,24(1):654-660. [7] Jin T,Lu X.A data-driven approach to text adaptation in teaching material preparation: Design,implementation,and teacher professional development[J].Tesol Quarterly,2017,52(2):457-467. [8] McNamara D S,et al.A hierarchical classification approach to automated essay scoring[J].Assessing Writing,2015(23):35-59. [9] Nandhini K,Balasundaram S R.Improving readability through individualized summary extraction,using interactive genetic algorithm[J].Applied Artificial Intelligence,2016,30(7):635-661. [10] Jin Y K,et al.Characterizing web content,user interests,and search behavior by reading level and topic[C]//ACM International Conference on Web Search and Data Mining.ACM,2012:213-222. [11] 孙刚.基于线性回归的中文文本可读性预测方法研究[D].南京: 南京大学硕士学位论文,2015. [12] Dubay W H.The principles of readability[J].Online Submission,2004,102(1):631-3309. [13] Dale E,Chall J S.A formula for predicting readability[J].Educational Research Bulletin,1948,27(1):37-54. [14] Gunning R.The technique of clear writing[M].McGraw-Hill,1952:36-37. [15] Laughlin G H M.SMOG Grading-A new readability formula[J].Journal of Reading,1969,12(8):639-646. [16] Caylor John S,et al.Methodologies for determining reading requirements of military occupational specialties[J].Adult Literacy,1973:81. [17] Kincaid J P,Fishburn R P,Chisson B S.Derivation of new readability formulas for navy enlisted personnel[J].Adult Basic Education,1975:49. [18] Feng L,Huenerfauth M.Cognitively motivated features for readability assessment[C]//Proceedingsof Conference of the European Chapter of the Association for Computational Linguistics.Association for Computational Linguistics,2009:229-237. [19] Danielle S McNamara,Walter Kintsch.Learning from texts: Effects of prior knowledge and text coherence[J].Discourse Processes,1996,22(3):247-288. [20] 宗成庆.统计自然语言处理[M].北京:清华大学出版社,2008. [21] Schwarm S E,Ostendorf M.Reading level assessment using support vector machines and statistical language models[C]//Meeting on Association for Computational Linguistics.Association for Computational Linguistics,2005:523-530. [22] Heilman M,Collins-Thompson K,Eskenazi M.An analysis of statistical models and features for reading difficulty prediction[C]//Proceedings of the Workshop on Innovative Use of Nlp for Building Educational Applications,2018: 71-79. [23] Feng J.Automatic readability assessment[J].Dissertations & Theses-Gradworks,2010,(93):84-91. [24] Luo S,Callan J.A statistical model for scientific readability[C]//Proceedings of 10th International Conference on Information and Knowledge Management.ACM,2001:574-576. [25] Collins-Thompson K,Callan J P.A language modeling approach to predicting reading difficulty[C]//Human Language Technologies: the 2004 Conference of the North American Chapter of the Association for Computational Linguistics.Association for Computational Linguistics,2004:193-200. [26] Cortes C,Vapnik V.Support-vector networks[J].Machine Learning,1995,20(3):273-297. [27] Petersen S E,Ostendorf M.A machine learning approach to reading level assessment[J].Computer Speech & Language,2009,23(1):89-106. [28] Aluisio S,et al.Readability assessment for text simplification[C]//NAACL Hlt 2010 15th Workshop on Innovative Use of NLP for Building Educational Applications.Association for Computational Linguistics,2010:1-9. [29] Vajjala S,Meurers D.On improving the accuracy of readability classification using insights from second language acquisition[C]//Proceedings of the Workshop on Building Educational Applications Using NLP.Association for Computational Linguistics,2012:163-173. [30] Shen W,et al.A language-independent approach to automatic text difficulty assessment for second-language learners[J].2013:30-38. [31] Kate R J,et al.Learning to predict readability using diverse linguistic features[C]//Proceedings of Coling 2010 - 23rd International Conference on Computational Linguistics,Proceedings of the Conference.COLING,2010:546-554. [32] Chen Y T,Chen Y H,Cheng Y C.Assessing Chinese readability using term frequency and lexical chain[J].中文计算语言学期刊,2013,18(2):1-17. [33] Cha M,Gwon Y,Kung H T.Language modeling by clustering with word embeddings for text readability assessment[C]//ACM,2017:2003-2006. [34] Tanaka-Ishii K,Tezuka S,Terada H.Sorting texts by readability[M].MIT Press,2010. [35] 佐藤理史.均衡コーパスを規範とするテキスト難易度測定[J].情報処理学会論文誌,2011,52(4):1777-1789. [36] Schumacher E,et al.Predicting the relative difficulty of single sentences with and without surrounding context[C]//Proceedings of Conference on Empirical Methods in Natural Language Processing.2016:1871-1881. [37] Schlkopf,B,Platt,J,Hofmann,T.TrueSkillTM: A Bayesian Skill Rating System[M]//Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference.MIT Press,2007:569-576. [38] 陈茹玲,蔡鑫廷,宋曜廷,等.文本适读性分级架构之建立研究[J].层级分析法,2015,60(1):001-032. [39] Chall J S,Dale E.Readability revisited: the new Dale-Chall readability formula[J].Brookline Books,1995:149. [40] Graesser A C,Mcnamara D S,Kulikowich J M.Coh-Metrix: Providing multilevel analyses of text characteristics[J].Educational Researcher,2015,40(5):223-234. [41] Barzilay R,Lapata M.Modeling local coherence: An entity-based approach[C]//Proceedings of Meeting on Association for Computational Linguistics.Association for Computational Linguistics,2005:141-148. [42] Lin S Y,et al.Assessing text readability using hierarchical lexical relations retrieved from WordNet[J].中文计算语言学期刊,2009,14(1):45-83. [43] Flor M,Klebanov B B,Sheehan K M.Lexical tightness and text complexity[C]//The Workshop on Natural Language Processing for Improving Textual Accessibility,2013:29-38. [44] Lu X.Automatic analysis of syntactic complexity in second language writing[J].International Journal of Corpus Linguistics,2010,15(4):474-496. [45] Heilman M,Collins-Thompson K,Callan J,et al.Combining lexical and grammatical features to improve readability measures for first and second language texts[C]//Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics,Proceedings,April 22-27,2007,Rochester,New York,USA.DBLP,2007:460-467. [46] Pitler E,Nenkova A.Revisiting readability: A unified framework for predicting text quality[C]//The Conference on Empirical Methods in Natural Language Processing,2008:186-195. [47] Vajjala S.Automated assessment of non-native learner essays: Investigating the role of linguistic features[J].International Journal of Artificial Intelligence in Education,2016,28(1):1-27. [48] Miltsakaki E,Prasad R,Joshi A,et al.The Penn Discourse Treebank[J].Proceedings of Lrec,2004,24(1):2961-2968. [49] Sung Y T,Chen J L,Cha J H,et al.Constructing and validating readability models: The method of integrating multilevel linguistic features with machine learning[J].Behavior Research Methods,2015,47(2):340-354. [50] Clercq O D,et al.Using the crowd for readability prediction[J].Natural Language Engineering,2014,20(3):293-325. [51] 杨孝溁.实用中文报纸可读性公式[J].新闻学研究,1974,13:37-62. [52] Hong J F,Sung Y T,Tseng H C,et al.A multilevel analysis of the linguistic features affecting Chinese text readability[J].台湾华语教学研究,2016,(13):95-126. [53] 张必隐,孙汉银.中文易懂性公式[C].中美教育问题研讨会论文集,1992: 246-249. [54] 荆溪昱.中文国文教材的适读性研究: 适读年级值的推估[J].教育研究资讯,1995,3(3):113-127. [55] 赵金铭.论对外汉语教材评估[J].语言教学与研究,1998,(3):4-19. [56] 朱勇.汉语分级读物的现状与研发对策[J].国际汉语教学研究,2015,(2):15-17. [57] 张宁志.汉语教材语料难度的定量分析[J].世界汉语教学,2000,(3):83-88. [58] 李燕,张英伟.《博雅汉语》教材语料难度的定量分析——兼谈影响教材语言难度的因素和题材的选择[J].云南师范大学学报(对外汉语教学与研究版),2010,8(1):39-43. [59] 罗素华.汉语中级泛读教材难度定量分析——以三部中级汉语泛读教材为例[D].长沙: 湖南师范大学硕士学位论文,2015. [60] 郭望皓.对外汉语文本易读性公式研究[D].上海:上海交通大学硕士学位论文,2010. [61] 左虹,朱勇.中级欧美留学生汉语文本可读性公式研究[J].世界汉语教学,2014,(2):263-276. [62] 王蕾.初中级日韩学习者汉语文本可读性公式研究[J].语言教学与研究,2017,(5):15-25. [63] 邹红建,杨尔弘.面向对外汉语报刊教学的文本难易度分类[C]//学生计算语言学研讨会,2006:363-367. [64] Sung Y T,Chang T H,Lin W C,et al.CRIE: An automated analyzer for Chinese texts[J].Behavior Research Methods,2015,48(4):1-14. [65] 孙刚.基于线性回归的中文文本可读性预测方法研究[D].南京:南京大学硕士学位论文,2015. [66] 曾厚强,陈柏琳,宋曜廷.探究使用基于类神经网路之特征于文本可读性分类[J].中文计算语言学期刊,2017,22(2):31-45. [67] Kucan L,Beck I L.Thinking aloud and reading comprehension research: inquiry,instruction,and social interaction[J].Review of Educational Research,1997,67(3):271-299. [68] Sheehan K M,et al.Generating Automated Text Complexity Classifications That Are Aligned with Targeted Text Complexity Standards[J].ETS Research Report Series,2010,10(2):i-44.