文本可读性的自动分析研究综述

吴思远,蔡建永,于东,江新

PDF(1486 KB)
PDF(1486 KB)
中文信息学报 ›› 2018, Vol. 32 ›› Issue (12) : 1-10.
综述

文本可读性的自动分析研究综述

  • 吴思远1,2,蔡建永2,3,于东1,江新2
作者信息 +

A Survey on the Automatic Text Readability Measures

  • WU Siyuan1,2, CAI Jianyong2,3, YU Dong1, JIANG Xin2
Author information +
History +

摘要

文本可读性问题最初由教育学家提出,初衷是辅助教师为语言学习者推荐适合其阅读水平的文本。随着计算机技术的发展及网页文本的涌现,对文本进行可读性分析有了更加丰富的技术手段和应用场景。该文对可读性自动分析的相关研究进行了梳理,将可读性自动分析的方法总结为公式法、分类法和排序法三类;然后进一步介绍了可读性自动分析中的两项重要内容:文本特征的选择和数据集的使用;最后对可读性研究的发展方向进行展望。

Abstract

The concept of readability is originally proposed by educators to assist the selection of suitable reading materials for learners. This paper surveys the existing works on automatic text readability measures, and summarized three types of methods: formula-based method, classification method and ranking method. This paper also outlines the databases and the extracted features in the literature. And finally, the future developments of the automatic readability research is provided.

关键词

文本可读性 / 可读性分析 / 特征提取

Key words

text readability / readability analysis / feature selection

引用本文

导出引用
吴思远,蔡建永,于东,江新. 文本可读性的自动分析研究综述. 中文信息学报. 2018, 32(12): 1-10
WU Siyuan, CAI Jianyong, YU Dong, JIANG Xin. A Survey on the Automatic Text Readability Measures. Journal of Chinese Information Processing. 2018, 32(12): 1-10

参考文献

[1] Michael B W Wolfe,et al.Learning from text: Matching readers and texts by latent semantic analysis[J].Discourse Processes,1998,25(2-3):309-336.
[2] 王蕾.可读性公式的内涵及研究范式——兼议对外汉语可读性公式的研究任务[J].语言教学与研究,2008,(6):46-53.
[3] Vogel M,Washburne C.An objective method of determining grade placement of children’s reading material[J].Elementary School Journal,1928,28(5):373-381.
[4] Sheehan K M,Kostin I,Napolitano D,et al.The TextEvaluator tool: Helping teachers and test developers select texts for use in instruction and assessment [J].Elementary School Journal,2014,115(2):184-209.
[5] 郭曙纶.试论对外汉语教材中的超纲词[J].宁夏大学学报(人文社会科学版),2008,30(4):25-29.
[6] Sato S.Automatic assessment of Japanese text readability based on a textbook corpus[J].Proc of Lrec08 Marrakech Morocco,2008,24(1):654-660.
[7] Jin T,Lu X.A data-driven approach to text adaptation in teaching material preparation: Design,implementation,and teacher professional development[J].Tesol Quarterly,2017,52(2):457-467.
[8] McNamara D S,et al.A hierarchical classification approach to automated essay scoring[J].Assessing Writing,2015(23):35-59.
[9] Nandhini K,Balasundaram S R.Improving readability through individualized summary extraction,using interactive genetic algorithm[J].Applied Artificial Intelligence,2016,30(7):635-661.
[10] Jin Y K,et al.Characterizing web content,user interests,and search behavior by reading level and topic[C]//ACM International Conference on Web Search and Data Mining.ACM,2012:213-222.
[11] 孙刚.基于线性回归的中文文本可读性预测方法研究[D].南京: 南京大学硕士学位论文,2015.
[12] Dubay W H.The principles of readability[J].Online Submission,2004,102(1):631-3309.
[13] Dale E,Chall J S.A formula for predicting readability[J].Educational Research Bulletin,1948,27(1):37-54.
[14] Gunning R.The technique of clear writing[M].McGraw-Hill,1952:36-37.
[15] Laughlin G H M.SMOG Grading-A new readability formula[J].Journal of Reading,1969,12(8):639-646.
[16] Caylor John S,et al.Methodologies for determining reading requirements of military occupational specialties[J].Adult Literacy,1973:81.
[17] Kincaid J P,Fishburn R P,Chisson B S.Derivation of new readability formulas for navy enlisted personnel[J].Adult Basic Education,1975:49.
[18] Feng L,Huenerfauth M.Cognitively motivated features for readability assessment[C]//Proceedingsof Conference of the European Chapter of the Association for Computational Linguistics.Association for Computational Linguistics,2009:229-237.
[19] Danielle S McNamara,Walter Kintsch.Learning from texts: Effects of prior knowledge and text coherence[J].Discourse Processes,1996,22(3):247-288.
[20] 宗成庆.统计自然语言处理[M].北京:清华大学出版社,2008.
[21] Schwarm S E,Ostendorf M.Reading level assessment using support vector machines and statistical language models[C]//Meeting on Association for Computational Linguistics.Association for Computational Linguistics,2005:523-530.
[22] Heilman M,Collins-Thompson K,Eskenazi M.An analysis of statistical models and features for reading difficulty prediction[C]//Proceedings of the Workshop on Innovative Use of Nlp for Building Educational Applications,2018: 71-79.
[23] Feng J.Automatic readability assessment[J].Dissertations & Theses-Gradworks,2010,(93):84-91.
[24] Luo S,Callan J.A statistical model for scientific readability[C]//Proceedings of 10th International Conference on Information and Knowledge Management.ACM,2001:574-576.
[25] Collins-Thompson K,Callan J P.A language modeling approach to predicting reading difficulty[C]//Human Language Technologies: the 2004 Conference of the North American Chapter of the Association for Computational Linguistics.Association for Computational Linguistics,2004:193-200.
[26] Cortes C,Vapnik V.Support-vector networks[J].Machine Learning,1995,20(3):273-297.
[27] Petersen S E,Ostendorf M.A machine learning approach to reading level assessment[J].Computer Speech & Language,2009,23(1):89-106.
[28] Aluisio S,et al.Readability assessment for text simplification[C]//NAACL Hlt 2010 15th Workshop on Innovative Use of NLP for Building Educational Applications.Association for Computational Linguistics,2010:1-9.
[29] Vajjala S,Meurers D.On improving the accuracy of readability classification using insights from second language acquisition[C]//Proceedings of the Workshop on Building Educational Applications Using NLP.Association for Computational Linguistics,2012:163-173.
[30] Shen W,et al.A language-independent approach to automatic text difficulty assessment for second-language learners[J].2013:30-38.
[31] Kate R J,et al.Learning to predict readability using diverse linguistic features[C]//Proceedings of Coling 2010 - 23rd International Conference on Computational Linguistics,Proceedings of the Conference.COLING,2010:546-554.
[32] Chen Y T,Chen Y H,Cheng Y C.Assessing Chinese readability using term frequency and lexical chain[J].中文计算语言学期刊,2013,18(2):1-17.
[33] Cha M,Gwon Y,Kung H T.Language modeling by clustering with word embeddings for text readability assessment[C]//ACM,2017:2003-2006.
[34] Tanaka-Ishii K,Tezuka S,Terada H.Sorting texts by readability[M].MIT Press,2010.
[35] 佐藤理史.均衡コーパスを規範とするテキスト難易度測定[J].情報処理学会論文誌,2011,52(4):1777-1789.
[36] Schumacher E,et al.Predicting the relative difficulty of single sentences with and without surrounding context[C]//Proceedings of Conference on Empirical Methods in Natural Language Processing.2016:1871-1881.
[37] Schlkopf,B,Platt,J,Hofmann,T.TrueSkillTM: A Bayesian Skill Rating System[M]//Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference.MIT Press,2007:569-576.
[38] 陈茹玲,蔡鑫廷,宋曜廷,等.文本适读性分级架构之建立研究[J].层级分析法,2015,60(1):001-032.
[39] Chall J S,Dale E.Readability revisited: the new Dale-Chall readability formula[J].Brookline Books,1995:149.
[40] Graesser A C,Mcnamara D S,Kulikowich J M.Coh-Metrix: Providing multilevel analyses of text characteristics[J].Educational Researcher,2015,40(5):223-234.
[41] Barzilay R,Lapata M.Modeling local coherence: An entity-based approach[C]//Proceedings of Meeting on Association for Computational Linguistics.Association for Computational Linguistics,2005:141-148.
[42] Lin S Y,et al.Assessing text readability using hierarchical lexical relations retrieved from WordNet[J].中文计算语言学期刊,2009,14(1):45-83.
[43] Flor M,Klebanov B B,Sheehan K M.Lexical tightness and text complexity[C]//The Workshop on Natural Language Processing for Improving Textual Accessibility,2013:29-38.
[44] Lu X.Automatic analysis of syntactic complexity in second language writing[J].International Journal of Corpus Linguistics,2010,15(4):474-496.
[45] Heilman M,Collins-Thompson K,Callan J,et al.Combining lexical and grammatical features to improve readability measures for first and second language texts[C]//Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics,Proceedings,April 22-27,2007,Rochester,New York,USA.DBLP,2007:460-467.
[46] Pitler E,Nenkova A.Revisiting readability: A unified framework for predicting text quality[C]//The Conference on Empirical Methods in Natural Language Processing,2008:186-195.
[47] Vajjala S.Automated assessment of non-native learner essays: Investigating the role of linguistic features[J].International Journal of Artificial Intelligence in Education,2016,28(1):1-27.
[48] Miltsakaki E,Prasad R,Joshi A,et al.The Penn Discourse Treebank[J].Proceedings of Lrec,2004,24(1):2961-2968.
[49] Sung Y T,Chen J L,Cha J H,et al.Constructing and validating readability models: The method of integrating multilevel linguistic features with machine learning[J].Behavior Research Methods,2015,47(2):340-354.
[50] Clercq O D,et al.Using the crowd for readability prediction[J].Natural Language Engineering,2014,20(3):293-325.
[51] 杨孝溁.实用中文报纸可读性公式[J].新闻学研究,1974,13:37-62.
[52] Hong J F,Sung Y T,Tseng H C,et al.A multilevel analysis of the linguistic features affecting Chinese text readability[J].台湾华语教学研究,2016,(13):95-126.
[53] 张必隐,孙汉银.中文易懂性公式[C].中美教育问题研讨会论文集,1992: 246-249.
[54] 荆溪昱.中文国文教材的适读性研究: 适读年级值的推估[J].教育研究资讯,1995,3(3):113-127.
[55] 赵金铭.论对外汉语教材评估[J].语言教学与研究,1998,(3):4-19.
[56] 朱勇.汉语分级读物的现状与研发对策[J].国际汉语教学研究,2015,(2):15-17.
[57] 张宁志.汉语教材语料难度的定量分析[J].世界汉语教学,2000,(3):83-88.
[58] 李燕,张英伟.《博雅汉语》教材语料难度的定量分析——兼谈影响教材语言难度的因素和题材的选择[J].云南师范大学学报(对外汉语教学与研究版),2010,8(1):39-43.
[59] 罗素华.汉语中级泛读教材难度定量分析——以三部中级汉语泛读教材为例[D].长沙: 湖南师范大学硕士学位论文,2015.
[60] 郭望皓.对外汉语文本易读性公式研究[D].上海:上海交通大学硕士学位论文,2010.
[61] 左虹,朱勇.中级欧美留学生汉语文本可读性公式研究[J].世界汉语教学,2014,(2):263-276.
[62] 王蕾.初中级日韩学习者汉语文本可读性公式研究[J].语言教学与研究,2017,(5):15-25.
[63] 邹红建,杨尔弘.面向对外汉语报刊教学的文本难易度分类[C]//学生计算语言学研讨会,2006:363-367.
[64] Sung Y T,Chang T H,Lin W C,et al.CRIE: An automated analyzer for Chinese texts[J].Behavior Research Methods,2015,48(4):1-14.
[65] 孙刚.基于线性回归的中文文本可读性预测方法研究[D].南京:南京大学硕士学位论文,2015.
[66] 曾厚强,陈柏琳,宋曜廷.探究使用基于类神经网路之特征于文本可读性分类[J].中文计算语言学期刊,2017,22(2):31-45.
[67] Kucan L,Beck I L.Thinking aloud and reading comprehension research: inquiry,instruction,and social interaction[J].Review of Educational Research,1997,67(3):271-299.
[68] Sheehan K M,et al.Generating Automated Text Complexity Classifications That Are Aligned with Targeted Text Complexity Standards[J].ETS Research Report Series,2010,10(2):i-44.

基金

国家社会科学基金(17ZDA305);中央高校基本科研业务费专项资金资助项目(17PT05)
PDF(1486 KB)

1502

Accesses

0

Citation

Detail

段落导航
相关文章

/