基于词汇迁移的跨语言形态复用

刘伍颖,王琳

PDF(1887 KB)
PDF(1887 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (8) : 18-24.
机器翻译

基于词汇迁移的跨语言形态复用

  • 刘伍颖1,2,王琳3
作者信息 +

Lexical-Transfer-Based Cross-language Morphological Reuse

  • LIU Wuying1,2, WANG Lin3
Author information +
History +

摘要

良好结构化语言资源的稀缺导致一些自然语言处理算法无法在低资源语言上获得更高的性能。据此,针对两种语言之间的形态复用科学问题,该文提出一种形态迁移率评估指标用以评估迁移效果,并验证了形态复用在低资源语言的语言资源建设和语义转述应用任务中的有效性。在语言资源建设实验中,该文从马来语语料库提取印尼语多词表示,并从印尼语语料库提取马来语多词表示。在语义转述应用实验中,该文通过马来语资源增强的印尼语-汉语平行句库训练出印尼语-汉语神经机器翻译模型,并通过印尼语资源增强的马来语-汉语平行句库训练出马来语-汉语神经机器翻译模型。实验结果证明,由于同语族的形态同源性和相似性,同语族跨语言形态复用具有很强的可迁移性和可计算性。

Abstract

The scarcity of well-structured language resources defeats some natural language processing algorithms to achieve higher performance in low-resource languages. On this basis, this paper addresses the scientific problem of morphological reuse between two languages, proposes an evaluation metric of morphological transfer ratio to estimate the transfer effect, and verifies the effectiveness of morphological reuse in the tasks of language resource construction and semantic paraphrasing application for low-resource languages. In the experiment of language resource construction, we extract Indonesian multiword expressions from a Malay corpus and extract Malay multiword expressions from an Indonesian corpus. In the experiment of semantic paraphrasing application, we train an Indonesian-Chinese neural machine translation model by the Malay-resource-enhanced Indonesian-Chinese parallel sentence corpus and train a Malay-Chinese neural machine translation model by the Indonesian-resource-enhanced Malay-Chinese parallel sentence corpus. The experimental results prove that, due to the morphological homology and similarity from the same language family, the cross-language morphological reuse has a strong transferability and computability in the same language family.

关键词

形态复用 / 形态迁移率 / 低资源语言 / 多词表示提取 / 神经机器翻译

Key words

morphological reuse / morphological transfer ratio / low-resource language / multiword expression extraction / neural machine translation

引用本文

导出引用
刘伍颖,王琳. 基于词汇迁移的跨语言形态复用. 中文信息学报. 2023, 37(8): 18-24
LIU Wuying, WANG Lin. Lexical-Transfer-Based Cross-language Morphological Reuse. Journal of Chinese Information Processing. 2023, 37(8): 18-24

参考文献

[1] 李洪政, 冯冲, 黄河燕. 稀缺资源语言神经网络机器翻译研究综述[J]. 自动化学报, 2021, 47(6):1217-1231.
[2] 刘伍颖, 王挺, 罗准辰. 面向多源垃圾信息过滤的直推式迁移学习算法[C]//2008中国计算机大会论文集. 北京: 清华大学出版社, 2010: 32-42.
[3] LIU W. Supervised ensemble learning for vietnamese tokenization[J]. International Journal of Uncertainty, Fuzziness and Knowledge-based Systems, 2017, 25(2): 285-299.
[4] MATHIEU C, GULSEN E, JOHANNA M, et al. Multiword expression processing: A survey[J]. Computational Linguistics, 2017, 43(4): 837-892.
[5] MEGHDAD F, RONALDO M. A supervised model for extraction of multiword expressions based on statistical context features[C]//Proceedings of the 10th Workshop on Multiword Expressions. Gothenburg: Association for Computational Linguistics, 2014: 10-16.
[6] MOHAMED AL B, ABDELATI H, MAHMOUD G, et al. SAMER: A semi-automatically created lexical resource for arabic verbal multiword expressions tokens paradigm and their morphosyntactic features[C]//Proceedings of the 12th Workshop on Asian Language Resources. Osaka: COLING Organizing Committee, 2016: 113-122.
[7] CAMPBELL H, MASAYUKI A, YJI M. Automatic extraction of fixed multiword expressions[C]//Proceedings of the 2nd International Joint Conference on Natural Language Processing. Jeju Island: Springer, 2005: 565-575.
[8] MARION W, ULRICH H. Extraction of german multiword expressions from parsed corpora using context features[C]//Proceedings of the 7th International Conference on Language Resources and Evaluation. Valletta: European Language Resources Association, 2010: 3195-3201.
[9] MOHAMMED A, ANTONIO T, LAMIA T, et al. Automatic extraction of arabic multiword expressions[C]//Proceedings of the Multiword Expressions: From Theory to Applications. Beijing: COLING Organizing Committee, 2010: 19-27.
[10] SENEM K M. feature selection in multiword expression recognition[J]. Expert Systems with Applications, 2018, 92: 106-123.
[11] WUYING L, LIN W. Unsupervised ensemble learning for vietnamese multisyllabic word extraction[C]//Proceedings of the 20th International Conference on Asian Language Processing. Tainan: IEEE, 2016: 353-357.
[12] TOM Y, DEVAMANYU H, SOUJANYA P, et al. Recent trends in deep learning based natural language processing[J/OL]. arXiv preprint arXiv:1708.02709v8,2018.
[13] 刘群. 统计机器翻译综述[J]. 中文信息学报, 2003, 17(4): 1-12.
[14] YANN L C, YOSHUA B, GEOFFREY H. Deep learning[J]. Nature, 2015, 521:436-444.
[15] 刘洋. 神经机器翻译前沿进展[J]. 计算机研究与发展, 2017, 54(6): 1144-1149.
[16] 冯洋, 邵晨泽. 神经机器翻译前沿综述[J]. 中文信息学报, 2020, 34(7): 1-18.
[17] ILYA S, ORIOL V, QUOC V. Le. Sequence to sequence learning with neural networks[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2014: 3104-3112.
[18] KYUNGHYUN C, BART V M, CAGLAR G, et al. Learning phrase representations using RNN Encoder-Decoder for statistical machine translation[C]//Proceedings of EMNLP, 2014: 1724-1734.
[19] QIANG W, BEI L, TONG X, et al. Wong, Lidia S. Chao. Learning deep transformer models for machine translation[C]//Proceedings of ACL, 2019: 1810-1822.
[20] WUYING L, LIN W. Fast-syntax-matching-based japanese-chinese limited machine translation[C]//Proceedings of the 5th International Conference on Natural Language Processing and Chinese Computing. Kunming: Springer, 2016: 621-630.
[21] WUYING L, LI L. Probabilistic ensemble learning for vietnamese word segmentation[C]//Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: Association for Computing Machinery, 2014: 931-934.
[22] GRAHAM N. Neural machine translation and sequence-to-sequence models: A tutorial[J/OL]. arXiv preprint arXiv:1703.01619v1,2017.
[23] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.

基金

教育部新文科研究与改革实践项目(2021060049);山东省研究生教育教学改革研究项目(SDYJG21185);山东省本科教学改革研究重点项目(Z2021323);教育部人文社会科学研究青年基金项目(20YJC740062);上海市哲学社会科学“十三五”规划课题(2019BYY028);教育部人文社会科学研究规划基金项目(20YJAZH069);广州市科技计划项目(202201010061)
PDF(1887 KB)

961

Accesses

0

Citation

Detail

段落导航
相关文章

/