建立手语汉语平行语料库的目的是用于机器翻译和语言对比研究,并且能够系统地保存手语资源,保护手语和聋人文化。手语汉语平行语料库存储的内容主要包括手语视频、被采集者信息和标注者信息,以及通过多媒体标注软件ELAN转写的十四层标注信息,包括手控和非手控信息。该文采用基于向量空间的余弦相似性算法,实现了用手语语料相似度的计算来帮助语料库去重,并取得了较明显的效果;同时用此算法进行专家相似度测试以确保语料库的质量。
Abstract
The parallel corpus of Chinese and sign language construction is of significance in machine translation and contrastive language studies. The copus presented in this paper consists of the video of the sign language, information of the collectors and annotators, as well as 14 layers of labeling information via the multimedia labeling software ELAN (either manual or non-manual information). The cosine similarity based on VSM is adopted to reduce corpus deduplication. It is also used to test the similarity of the expert to ensure the quality of the corpus.
关键词
手语 /
平行语料库 /
转写
{{custom_keyword}} /
Key words
sign language /
parallel corpus /
gloss
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] 刘俊飞. 手语是一种自然语言[N]. 中国社会科学报,2012-03-26(B04).
[2] 姚登峰, 江铭虎, 阿布都克力木·阿布力孜,等. 中国手语信息处理述评[J]. 中文信息学报, 2015, 29(5):216-227.
[3] 刘超朋. 平行语料库概述[J]. 燕山大学学报(哲学社会科学版), 2007(s1).
[4] 冯志伟.中国语料库研究的历史与现状[J].Journal of Chinese Language and Computing,2002,12(1):43-62.
[5] Johnston T. W(h)ither the deaf community population, genetics, and the future of Australian sign language.[J]. American Annals of the Deaf, 2004, 148(5):358.
[6] Bungeroth J, Stein D, Dreuw P, et al. A German sign language corpus of the domain weather report[C]//Proceedings of the 5th International Conterence on Langnage Resources and Evaluation.
[7] 赵晓驰, 任媛媛, 丁勇. 国家手语词汇语料库的建设与使用[J]. 中国特殊教育,2017(1): 43-47.
[8] 黄晓晓. 基于情景语料库的自然手语构词研究[D]. 南京: 南京师范大学硕士学位论文, 2012.
[9] Crasborn O, Zwitserlood I. The corpus NGT: An online corpus for professionals and laymen[C]//Proceedings of the 3rd Workshop on the Representation and Processing of Sign Langnages: Construction and Exploitation of Sign Language Corpora, 2008:44-49.
[10] Inge Zwitserlood,Onno Crasborn,Johan Ros. The Corpus Neder Landse Gebarenteal(NGT,Sign language of the Netherlands)[C]//Proceedings of The NGT Workshop on Sign Language Corpora:Linguis-c,2009:44-49
[11] Birgit Hellwig ELAN - Linguistic Annotator version 4.5.0[M/OL]. (2013-01-07)[2018-09-29]. http://tla.mpi.nl/tools/tlatools/elan/.
[12] 李恒, 吴铃. 手语语料库建设基本方法[J]. 中国特殊教育, 2013(3):38-42.
[13] 龚群虎, 杨军辉. 中国手语的汉语转写方案[C],2009-2-13.
[14] Swart W D, Asmussen M L, Mccoskey J S. Video and digital multimedia aggregator remote content crawler: US, US 8285701 B2[P]. 2012.
[15] Rodriguez K J, Bryant M, Blanke T, et al. Comparison of Named Entity Recognition tools for raw OCR text[C]//Proceedings of Konvens 2012,2012:410-414.
[16] Lienhart R, Effelsberg W, Jain R. Visual GREP: A systematic method to compare and retrieve video sequences[J]. Multimedia Tools & Applications, 2000, 10(1):47-72.
[17] 周生, 胡晓峰, 罗批,等. 视频语义相似度网络研究[J]. 计算机应用, 2010, 30(7):1962-1966.
[18] 吕会华, 刘辉, LVHui-hua,等. 基于ELAN软件的中国手语语料库建设研究与实践[J]. 中国听力语言康复科学杂志, 2014(4):298-301.
[19] 姚登峰, 江铭虎, 阿布都克力木·阿布力孜,等. 基于音系学模型的手语理解[J]. 中文信息学报, 2018, 32(1): 56-67.
[20] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space[J]. arxiv Print 2013,arxiv: 1301.3781,2013.
[21] 陈二静, 姜恩波. 文本相似度计算方法研究综述[J]. 数据分析与知识发现, 2017, 1(6):1-11.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家语委重点项目(ZDI135-31);北京教育科学规划重点课题(ADA14121);北京联合大学研究生资助项目
{{custom_fund}}