中文文本可读性分级任务的目标是将中文文本按照其可读性划分到相应的难度等级。近年来研究表明,语言特征与深度语义特征在表征文章难度上体现出互补性。但已有的工作仅对两类特征进行浅层融合,尚未考虑将语言特征和深度模型进行深层、多层级融合。因此,该文在基于BERT的传统文本可读性分级模型的基础上,设计多层级语言特征融合方法,考虑到不同语言特征和网络层结构的交互,将汉字、词汇和语法的语言特征与模型的嵌入层和自注意力层进行融合。实验结果显示,该文的方法在中文文本可读性分级任务上的效果超过了所有基线模型,并在测试集上达到94.2%的准确率。
Abstract
The goal of Chinese text readability grading task is to classify Chinese texts into the appropriate difficulty levels for readers. Recent studies have shown that linguistic features and deep semantic features are complementary in characterizing the difficulty of text. However, existing work only performed shallow fusion of these two types of features, and deep, multi-level fusion has not been considered. Therefore, this paper develops a multi-level linguistic feature fusion strategy based on the traditional text readability grading model on BERT. Specifically, considering the interaction of different linguistic features and network layer structures, this paper fused the linguistic features of characters, words and grammar in the embedding layer as well as the self-attention layer. The experimental results show that the proposed method outperforms all baseline models and by 94.2% accuracy.
关键词
中文文本可读性分级 /
多层级特征融合 /
深度模型
{{custom_keyword}} /
Key words
Chinese text readability grading /
multi-level linguistic feature fusion /
deep model
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 吴思远,蔡建永,于东,等.文本可读性的自动分析研究综述[J].中文信息学报,2018,32(12): 1-10.
[2] ARASE Y, UCHIDA S, KAJIWARA T. CEFR-based sentence difficulty annotation and assessment[J]. arXiv preprint arXiv: 2210.11766, 2022.
[3] 左虹,朱勇.中级欧美留学生汉语文本可读性公式研究[J].世界汉语教学,2014,28(02): 263-276.
[4] LEE B W, LEE J H J. Traditional readability formulas compared for English[J]. arXiv preprint arXiv: 2301.02975, 2023.
[5] 吴思远,于东,江新.汉语文本可读性特征体系构建和效度验证[J].世界汉语教学,2020,34(01): 81-97.
[6] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT, 2019: 4171-4186.
[7] LEE B W, LEE J. Prompt-based learning for text readability assessment[C]//Proceedings of the Association for Computational Linguistics, 2023: 1774-1779.
[8] 唐玉玲,张宇飞,于东.结合深度学习和语言难度特征的句子可读性计算方法[J].中文信息学报,2022,36(02): 29-39.
[9] 朱君辉,刘鑫,杨麟儿,等.汉语语法点特征及其在二语文本难度自动分级研究中的应用[J].语言文字应用,2022(03): 87-99.
[10] 程勇,徐德宽,董军.基于多元语言特征与深度特征融合的中文文本阅读难度自动分级研究[J].中文信息学报,2020,34(04): 101-110.
[11] COLLINS-THOMPSON K. Computational assessment of text readability: A survey of current and future research[J]. ITL-International Journal of Applied Linguistics, 2014, 165(2): 97-135.
[12] YANG S. A readability formula for Chinese language[D]. PhD diss., Madison: The University of Wisconsin-Madison, 1971.
[13] 江新,宋冰冰,姜悦,等.汉语水平考试(HSK)阅读测试文本的可读性分析[J].中国考试,2020(12): 30-37.
[14] 刘苗苗,李燕,王欣萌,等.分级阅读初探: 基于小学教材的汉语可读性公式研究[J].语言文字应用,2021(02): 116-126.
[15] 程勇,董军,晋淑华.基于新标准的汉语二语文本阅读难度分级体系构建与应用[J].世界汉语教学,2023,37(01): 98-110.
[16] 柏晓鹏,吉伶俐.篇章结构特征对文本可读性的影响[J].语言文字应用,2022(03): 62-72.
[17] 蒋智威. 面向可读性评估的文本表示技术研究[D].南京: 南京大学博士学位论文,2018.
[18] 程勇,徐德宽,吕学强.基于多元特征的文本阅读难度自动分级研究[J].数据分析与知识发现,2019,3(07): 103-112.
[19] BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3(01): 993-1022.
[20] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[21] LIU P, QIU X, HUANG X. Recurrent neural network for text classification with multi-task learning[C]//Proceedings of the 25th International Joint Conference on Artificial Intelligence, 2016: 2873-2879.
[22] CHEN Y. Convolutional neural network for sentence classification[D]. PhD Diss., Waterloo: University of Waterloo, 2015.
[23] LEE J, LIU M, LAM C Y, et al. Automatic difficulty assessment for Chinese texts[C]//Proceedings of the IJCNLP, System Demonstrations, 2017: 45-48.
[24] BREIMAN L. Random forests[J]. Machine Learning, 2001, 45: 5-32.
[25] LAN Z, CHEN M, GOODMAN S, et al. ALBERT: A lite BERT for self-supervised learning of language representations[C]//Proceedings of the International Conference on Learning Representations, 2020: 1-17.
[26] CLARK K, LUONG M T, LE Q V, et al. Electra: Pre-training text encoders as discriminators rather than generators[C]//Proceedings of the International Conference on Learning Representations, 2020: 1-18.
[27] CORTES C, VAPNIK V. Support-vector networks[J]. Machine Learning, 1995, 20: 273-297.
[28] COVER T, HART P. Nearest neighbor pattern classification[J]. IEEE Transactions on Information Theory, 1967, 13(1): 21-27.
[29] LEWIS D D. Naive (Bayes) at forty: The independence assumption in information retrieval[C]//Proceedings of the Machine Learning: 10th European Conference on Machine Learning Chemnitz, Germany, 1998: 4-15.
[30] QUINLAN J R. Induction of decision trees[J]. Machine Learning, 1986, 1: 81-106.
[31] FREUND Y, SCHAPIRE R, ABE N. A short introduction to boosting[J]. Journal of Japanese Society for Artificial Intelligence, 1999, 14(5): 771-780.
[32] LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[C]//Proceedings of the International Conference on Learning Representations, 2019: 1-8.
[33] YANG Z, YANG D, DYER C, et al. Hierarchical attention networks for document classification[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016: 1480-1489.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(62137001);教育部语合中心重点项目(21YH21B);教学资源建设重点项目(YHJC22ZD067);华东师范大学新中文教育专项课题(2022ECNU-WHCCYJ-29,2022ECNU-WHCCYJ-31)
{{custom_fund}}