基于神经网络模型的汉语文本难度分级

李文彪,吴云芳

PDF(8133 KB)
PDF(8133 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (2) : 158-168.
自然语言处理应用

基于神经网络模型的汉语文本难度分级

  • 李文彪1,吴云芳2
作者信息 +

Chinese Readability Assessment Based on Deep Neural Networks

  • LI Wenbiao1, WU Yunfang2
Author information +
History +

摘要

文本难度分级是自然语言处理在教育领域的一个基础性研究课题,用于自动判定一篇文章的阅读难度。该文基于深度神经网络模型对汉语文本阅读难度进行了探索,提出了一种CNN+LSTM的难度分级模型,并结合分级语料的特点采用了变长卷积层和块结构。在教材测试集和人工构建的测试集上进行了详细的实验分析,该文的神经网络模型超越了传统机器学习方法和主流神经网络方法,在根据学段划分的5级数据上分级系统的正确率达到了75.4%。

Abstract

Readability assessment is to automatically determine the reading difficulty of a given document. Focusing on Chinese readability assessment, this paper proposes a CNN + LSTM difficulty classification model with the variable-length convolutional layer and block structure. Extensive experiments on school textbooks and a manual-constructed test set show that the proposed method achieves 75.4% accuracy on 5-level difficulty prediction, which is superior to the existing models.

关键词

可读性评估 / 特征提取 / 深度学习

Key words

readability assessment / feature extraction / deep learning

引用本文

导出引用
李文彪,吴云芳. 基于神经网络模型的汉语文本难度分级. 中文信息学报. 2023, 37(2): 158-168
LI Wenbiao, WU Yunfang. Chinese Readability Assessment Based on Deep Neural Networks. Journal of Chinese Information Processing. 2023, 37(2): 158-168

参考文献

[1] 郭望皓. 对外汉语文本易读性公式研究[D]. 上海: 上海交通大学硕士学位论文, 2010.
[2] 左虹, 朱勇. 中级欧美留学生汉语文本可读性公式研究[J]. 世界汉语教学, 2014, 028(002): 263-276.
[3] 王蕾. 初中级日韩学习者汉语文本可读性公式研究[J]. 语言教学与研究, 187(05): 15-25.
[4] SCHWARM S E, OSTENDORF M . Reading level assessment using support vector machines and statistical language models[C]//Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, University of Michigan, USA. Association for Computational Linguistics, 2005.
[5] PITLER E, NENKOVA A. Revisiting readability: A unified framework for predicting text quality[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2008.
[6] WANG S, ANDERSEN E. Grammatical templates: improving text difficulty evaluation for language learners[C]//Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers, 2016: 1692-1702.
[7] 孙刚. 基于线性回归的中文文本可读性预测方法研究[D].南京: 南京大学硕士学位论文,2015.
[8] 蒋智威. 面向可读性评估的文本表示技术研究[D].南京: 南京大学博士学位论文,2018.
[9] 吴思远, 蔡建永, 于东, 等. 文本可读性的自动分析研究综述[J]. 中文信息学报, 2018, 32(12): 1-10.
[10] 于东, 吴思远, 耿朝阳, 等. 基于众包标注的语文教材句子难易度评估研究[J]. 中文信息学报, 2020, 34(2): 16-26.
[11] 程勇, 徐德宽, 董军. 基于多元语言特征与深度特征融合的中文文本阅读难度自动分级研究[J]. 中文信息学报, 2020, 34(4): 101-110.
[12] JOHNSON R, ZHANG T. Deep pyramid convolutional neural networks for text categorization[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017: 562-570.
[13] 中华人民共和国教育部. 义务教育语文课程标准: 2011年版[M]. 北京: 北京师范大学出版社, 2012.
[14] 国家对外汉语水平领导小组办公室汉语水平考试部. 汉语水平词汇与汉字等级大纲[S]. 北京: 经济科学出版社, 2001.
[15] KIM Y. Convolutional neural networks for sentence classification[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2014: 1746-1751.
[16] LIU P, QIU X, HUANG X. Recurrent neural network for text classification with multi-task learning[C]//Proceedings of the 25th International Joint Conference on Artificial Intelligence, 2016: 2873-2879.
[17] JOULIN A, GRAVE , BOJANOWSKI P, et al. Bag of tricks for efficient text classification[C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, 2017: 427-431.
[18] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the NIPS, 2017.
[19] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 4171-4186.

基金

国家自然科学基金(62076008,61936012)
PDF(8133 KB)

Accesses

Citation

Detail

段落导航
相关文章

/