Alignment and Annotation of Chinese-English Discourse Structure Parallel Corpus
FENG Wenhe
Department of Chinese Language and Literature, He Nan Institute of Science and Technology, Xinxiang, Henan 453003, China; School of Computer, Wuhan University, Wuhan, Hubei 430072, China
Abstract:Discourse structure parallel corpus is a corpus annotated with parallel discourse structure information for bilingual text. This paper proposes such an alignment and annotation strategy, the structural and relational alignment, which is the theoretical basis of Chinese-English discourse structure parallel corpus. This strategy is applied to the corpus building process, including segmental, structural, relational, and central alignment, having achieved an operation mode of parallel corps along with alignment and annotation working together, as well unit alignment and structural alignment. The strategy with the help of corresponding annotation software and the solutions to the difficulties has been proved to be an effective operation mode for discourse structure parallel corpus. Key wordsparallel corpus; alignment; discourse structure
[1] 柏晓静, 常宝宝, 詹卫东, 等. 构建大规模的汉英双语平行语料库[C]//机器翻译研究进展—2002年全国机器翻译研讨会论文集. 2002. [2] 王克非. 双语对应语料库: 研制与应用[M].北京: 外语教学与研究出版社.2004. [3] 刘泽权, 田璐, 刘超朋.《红楼梦》中英文平行语料库的创建[J]. 当代语言学, 2008, 10(4): 329-339. [4] Carlson L, Marcu D, Okurowski M E. Building a discourse-tagged corpus in the framework of rhetorical structure theory [C]//Proceedings of Jan van Kuppevelt and Ronnie W.Smith (eds.),Current and New Directions in Discourse and Dialogue, Kluwer Academic Publishers,2003: 85-112. [5] Wolf F, Gibson E. Representing discourse coherence: A corpus-based study [J]. Computational Linguistics, 2005, 31(2): 249-287. [6] Prasad R, Dinesh N, Lee A, et al. The Penn Discourse Treebank 2.0[C]//Proceedings of the 6th International Conference on Language Resources and Evaluation.2008. [7] Xue N. Annotating discourse connectives in the Chinese Treebank[C]//Proceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky. Association for Computational Linguistics, 2005: 84-91. [8] 乐明. 汉语篇章修辞结构的标注研究[J]. 中文信息学报, 2008, 22(4): 19-23. [9] Zhou Y, Xue N. PDTB-style Discourse Annotation of Chinese Text[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. 2012: 69-77. [10] 刘群. 汉英机器翻译若干关键技术研究[M].北京: 清华大学出版社.2008. [11] 李艳翠, 冯文贺, 周固栋, 等. 基于逗号的汉语子句识别研究[J]. 北京大学学报: 自然科学版, 2013 (1): 7-14.