面向司法领域的高质量开源藏汉平行语料库构建

PDF(1514 KB)

中文信息学报 ›› 2021, Vol. 35 ›› Issue (11) : 51-59.

语言资源建设

面向司法领域的高质量开源藏汉平行语料库构建

沙九^1,3,冯冲^1,3,周鹭琴²,李洪政^1,3,张天夫^1,3,慧慧⁴

作者信息 +

Constraction of High-quality and Open Source Tibetan-Chinese Parallel Corpus Judicial Domain

SHA Jiu^1,3, FENG Chong^1,3, ZHOU Luqin², LI Hongzheng^1,3, ZHANG Tianfu^1,3, HUI Hui⁴

Author information +

History +

摘要

面向司法领域的藏汉机器翻译面临严重的数据稀疏问题。该文从两个方面展开研究: 第一,相较通用领域,司法领域的藏语需要有更严谨的逻辑表达和更多的专业术语。然而,目前藏语资源在司法领域内缺乏对应的语料、稀缺专业术语词以及句法结构。第二,藏语的特殊词汇表达方式和特定句法结构使得通用语料构建方法难以构建藏汉平行语料库。因此,该文提出一种针对司法领域藏汉平行语料的轻量级构建方法。首先,采取人工标注的方法获取一个中等规模的司法领域藏汉专业术语表作为先验知识库,以避免领域越界而产生的语料逻辑表达问题和领域术语缺失问题;其次,从全国的地方法庭官网采集实例语料数据,例如,裁判文书。优先寻找藏文实例数据,其次是汉语,以避免后续构造藏语句子而丢失特殊的词汇表达和句式结构。基于以上原则采集藏汉语料构建高质量的藏汉平行语料库,具体方法包括: 爬虫获取语料,规则断章对齐检测,语句边界识别,语料库自动清洗。最终,该文构建了16万级规模的藏汉司法领域语料库,并通过多种翻译模型和交叉实验验证了构建的语料库具有高质量和鲁棒性等特点。另外,此语料库会开源以便相关研究人员用于科研工作。

Abstract

The current Tibetan-Chinese (Ti-Zh) Machine Translation in the judicial domain suffers from a severe data-sparse issue. The high-quality Ti-Zh corpus in the judicial domain is obstructed by two issues: 1) rigorous logical expression and professional terminology vocabulary in judicial domain, and 2) unique lexical expression and specific syntactic structure of Tibetan. In this paper, we propose a lightweight Ti-Zh parallel corpus construction method for the judicial domain. First, we construct a medium-scale Tibetan-Chinese terminology glossary of the judicial domain to as the prior knowledge to avoid the missing of logical expression and domain terminology. Secondly, we collect the case data, such as judgment documents, from the official websites of Chinese courts in various places, with a priority of Tibetan case data. Finally, we build a high-quality Tibetan-Chinese parallel corpus 160,000-sentence Ti-Zh parallel corpus of the judicial domain, and we evaluate its quality and robustness via a variety of translation models and cross-validation experiments.This corpus will be provided as an open-source to for related research.

导出引用

沙九,冯冲,周鹭琴,李洪政,张天夫,慧慧. 面向司法领域的高质量开源藏汉平行语料库构建. 中文信息学报. 2021, 35(11): 51-59

SHA Jiu, FENG Chong, ZHOU Luqin, LI Hongzheng, ZHANG Tianfu, HUI Hui. Constraction of High-quality and Open Source Tibetan-Chinese Parallel Corpus Judicial Domain. Journal of Chinese Information Processing. 2021, 35(11): 51-59

参考文献

[1] Zhang X, Zhao J, Lecun Y. Character-level convolutional networks for text classification[C]//Proceedings of the Neural Information Processing Systems, 2015.
[2] Jiao X, Yin Y, Shang L, et al. TinyBERT: Distilling BERT for natural language understanding [J]. arXiv preprint arXiv: 190910351, 2019.
[3] Ratner A J, Ehrenberg H R, Hussain Z, et al. Learning to compose domain-specific transformations for data augmentation [J]. Advances in Neural Information Processing Systems, 2017, 30: 3239-3249.
[4] Cubuk E D, Zoph B, Mane D, et al. Autoaugment: Learning augmentation strategies from data[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 113-123.
[5] Wei J, Zou K. EDA: Easy data augmentation techniques for boosting performance on text classification tasks [J]. arXiv preprint arXiv: 190111196, 2019.
[6] Zhu J, Gao F, Wu L, et al. Soft contextual data augmentation for neural machine translation [J]. arXiv preprint arXiv: 190510523, 2019.
[7] Li Z, Specia L. Improving neural machine translation robustness via data augmentation: Beyond back translation [J]. arXiv preprint arXiv: 191003009, 2019.
[8] Nag S, Kale M, Lakshminarasimhan V, et al. Incorporating bilingual dictionaries for low resource semi-supervised neural machine translation [J]. arXiv preprint arXiv: 200402071, 2020.
[9] Nishikawa S, Ri R, Tsuruoka Y. Data augmentation for learning bilingual word embeddings with unsupervised machine translation [J]. arXiv preprint arXiv: 200600262, 2020.
[10] Liu B, Huang L. NEJM-enzh: A parallel corpus for English-Chinese translation in the biomedical domain [J]. arXiv preprint arXiv: 200509133, 2020.
[11] Ahmadi S, Hassani H, Jaff D Q. Leveraging multilingual news websites for building a Kurdish parallel corpus [J]. arXiv preprint arXiv: 201001554, 2020.
[12] Han L, Jones G J, Smeaton A F. MultiMWE: Building a multi-lingual multi-word expression (MWE) parallel corpora [J]. arXiv preprint arXiv: 200510583, 2020.
[13] Gomes L, Lopes G. First steps towards coverage-based document alignment[C]//Proceedings of the 1st Conference on Machine Translation, 2016: 697-702.
[14] Read J, Dridan R, Oepen S, et al. Sentence boundary detection: A long solved problem?[C]//Proceedings of COLING 2012: Posters, 2012: 985-994.
[15] Salunke S S. Selenium webdriver in Python: Learn with examples [M]. CreateSpace Independent Publishing Platform, 2014.
[16] Ziemski M, Junczys-dowmunt M, Pouliquen B. The united nations parallel corpus v1. 0[C]//Proceedings of the 10th International Conference on Language Resources and Evaluation, 2016: 3530-3534.
[17] Simard M, Plamondon P. Bilingual sentence alignment: Balancing robustness and accuracy [J]. Machine Translation, 1998, 13(1): 59-80.
[18] Gale W A, Church K. A program for aligning sentences in bilingual corpora [J]. Computational Linguistics, 1993, 19(1): 75-102.
[19] Repar A, Podpec＇an V, Vavpetic＇ A, et al. TermEnsembler: An ensemble learning approach to bilingual term extraction and alignment [J]. Terminology International Journal of Theoretical and Applied Issues in Specialized Communication, 2019, 25(1): 93-120.
[20] Sennrich R, Volk M. MT-based sentence alignment for OCR-generated parallel texts[C]//Proceedings of the 9th Corference of the Association for Machine Translation in the Americas, 2010: 1-10.
[21] Klein G, Kim Y, Deng Y, et al. OpenNMT: Open-source toolkit for neural machine translation [J]. arXiv preprint arXiv: 170102810, 2017,
[22] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systerms, 2017: 6000-6010.
[23] SUN J. Jieba chinese word segmentation tool[CP/OL]. https://github.com//txsjy/jieba. [2020-08-09].
[24] Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units [J]. arXiv preprint arXiv: 150807909, 2015.
[25] 李亚超, 江静, 加羊吉, 等. TIP-LAS: 一个开源的藏文分词词性标注系统 [J]. 中文信息学报, 2015, 29(6): 203-7.
[26] 沙九, 冯冲, 张天夫, 等. 多策略切分粒度的藏汉双向神经机器翻译研究 [J]. 厦门大学学报 (自然科学版), 2020, 59(2): 213-219.
[27] Guo J, Tan X, Xu L, et al. Fine-tuning by curriculum learning for non-autoregressive neural machine translation[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 7839-7846.
[28] Devlin J, Chang M-W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv: 181004805, 2018.
[29] Lample G, Conneau A. Cross-lingual language model pretraining [J]. arXiv preprint arXiv: 190107291, 2019.
[30] Weng R, Yu H, Huang S, et al. Acquiring knowledge from pre-trained model to neural machine translation[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 9266-9273.

基金

国家重点研发计划(2018YFC0832104);国家自然科学基金(61732005)

PDF(1514 KB)

1766

Accesses

Citation

Detail

段落导航

摘要
Abstract
关键词
Key words
引用本文
参考文献
基金

Received	Published
2021-02-21	2021-11-20
Issue Date
2021-11-20

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注

基金