师佳璐,罗昕宇,杨麟儿,肖丹,胡正升,王一君,袁佳欣,余婧思,杨尔弘. 汉语学习者依存句法树库构建[J]. 中文信息学报, 2022, 36(1): 39-46.
SHI Jialu, LUO Xinyu, YANG Liner, XIAO Dan, HU Zhengsheng, WANG Yijun, YUAN Jiaxin, YU Jingsi, YANG Erhong. Construction of a Treebank of Learners Chinese. , 2022, 36(1): 39-46.
SHI Jialu1,2,3, LUO Xinyu1,2,3, YANG Liner1,2,3, XIAO Dan1,2,3, HU Zhengsheng1,2,3, WANG Yijun1,2 , YUAN Jiaxin1,2, YU Jingsi1,2, YANG Erhong1,3
1.National Language Monitoring and Research Center (CNLR) Print Media Language Branch, Beijing Language and Culture University, Beijing 100083, China; 2.School of Information Science, Beijing Language and Culture University, Beijing 100083, China; 3.Advanced Innovation Center for Language Resources, Beijing Language and Culture University, Beijing 100083, China
Abstract:A dependency treebank of Learner Chinese provides dependency parses for non-native sentences, which could promote the teaching and research on Chinese as a second language, and support related researches such as syntactic analysis of learner language and grammatical error correction. However, few dependency treebanks of learner Chinese are available, and there are still some problems in annotation guidelines. In this paper, we develop the annotation guideline, establish an online annotation platform, and build the Treebank of Learner Chinese. This paper also describes the details in data selection and annotation workflow, evaluates the quality of annotation, and explores the impact of errors on annotation quality and syntactic analysis.
[1] 黄昌宁,靳光瑾. 从宾州中文树库观察三个汉语语法问题[J]. 语言科学, 2013, 12(2): 178-192. [2] 刘挺, 马金山. 汉语自动句法分析的理论与方法[J]. 当代语言学, 2009, 11(2): 100-112. [3] 郭丽娟. 汉语依存句法分析树库构建及应用研究[D]. 苏州: 苏州大学硕士学位论文,2019. [4] 鲁健骥. 中介语理论与外国人学习汉语的语音偏误分析[J]. 语言教学与研究, 1984, 3: 44-56. [5] 李娟, 谭晓平, 杨丽姣. 汉语中介语语料库应用及发展对策研究[J]. 曲靖师范学院学报, 2016 (02): 86-91. [6] 肖丹,杨尔弘,张明慧,陆天荧,杨麟儿. 汉语中介语的依存句法标注规范及标注实践[J]. 中文信息学报, 2020, 34(11): 19-28,36. [7] Granger S. Learner corpora[J]. The Encyclopedia of Applied Linguistics, 2012: 1-8. [8] Ragheb M, Dickinson M. Developing a corpus of syntactically-annotated learner language for English[J]. CLARIN-D, 2014: 292-300. [9] Berzak Y, Kenney J, Spadine C, et al. Universal dependencies for learner English[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin: Association for Computational Linguitics, 2016: 737-746. [10] Sampson G. SUSANNE—A deeply analysed corpus of American English[J]. New Directions in English Language Corpora: Methodology, Results, Software Developments, 2011, 9: 171. [11] MacWhinney B. The Childes Project: Tools for Analyzing Talk, Volume I: Transcription Format and Programs [M]. Hove: Psychology Press, 2014. [12] De Marneffe M C, Dozat T, Silveira N, et al. Universal Stanford dependencies: A cross-linguistic typology[C] // Proceedings of the 9th International Conference on Language Resources and Evaluation. Reykjavik: ELRA, 2014: 4585-4592. [13] Nivre J, De Marneffe M C, Ginter F, et al. Universal dependencies v1: A multilingual treebank collection[C] // Proceedings of the 10th International Conference on Language Resources and Evaluation. Portoro: ELRA, 2016: 16591666. [14] Yannakoudakis H, Briscoe T, Medlock B. A new dataset and method for automatically grading ESOL texts[C] // Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland: Association for Computational Linguistics, 2011: 180-189. [15] 张宝林. “HSK 动态作文语料库” 的特色与功能[J]. 汉语国际教育, 2009: 71-79. [16] 张宝林. 汉语中介语语料库建设的现状与对策[J]. 语言文字应用, 2010, 3: 129-138. [17] 张宝林, 崔希亮. “全球汉语中介语语料库建设和研究” 的设计理念[J]. 语言教学与研究, 2013, 5: 27-34. [18] Lee J, Leung H, Li K. Towards universal dependencies for learner Chinese [C] //Proceedings of the NoDaLiDa Workshop on Universal Dependencies, 2017: 67-71. [19] Nivre J, de Marneffe M C, Ginter F, et al. Universal dependencies v2: An evergrowing multilingual treebank collection[C]//Proceedings of the 12th Language Resources and Evaluation Conference. 2020: 4034-4043. [20] 谭晓平, 杨丽姣, 苏靖杰. 面向汉语 (二语) 教学的语法点知识库构建及语法点标注研究[J]. 中文信息学报, 2015, 29(6): 54-61. [21] Xia F. The segmentation guidelines for the Penn Chinese Treebank (3.0), IRCS Report 00-06[R]. University of Pennsylvania, Oct, 2000. [22] Xia F. The part-of-speech tagging guidelines for the Penn Chinese Treebank, IRCS Reprot 00-07[R]. University of Pennsylvania, Oct, 2000. [23] 王兴全, 方忠. 现代出版物语言文字使用规范[M]. 西安:电子科技大学出版社, 2017. [24] 鲁健骥. 外国人学汉语的语法偏误分析[J]. 语言教学与研究, 1994, 1: 1-8.