面向汉越跨语言事件检索的事件预训练方法

吴少扬,余正涛,黄于欣,朱恩昌,高盛祥,邓同杰

PDF(3515 KB)
PDF(3515 KB)
中文信息学报 ›› 2024, Vol. 38 ›› Issue (4) : 78-85.
民族、跨境及周边语言信息处理

面向汉越跨语言事件检索的事件预训练方法

  • 吴少扬1,2,余正涛1,2,黄于欣1,2,朱恩昌1,2,高盛祥1,2,邓同杰1,2
作者信息 +

Event Pre-training for Chinese-Vietnamese Cross-lingual Event Retrieval

  • WU Shaoyang1,2, YU Zhengtao1,2, HUANG Yuxin1,2, ZHU Enchang1,2, GAO Shengxiang1,2, DENG Tongjie1,2
Author information +
History +

摘要

汉越跨语言事件检索是用汉语查询检索越南语事件新闻的任务。由于越南语属于典型的低资源语言,汉越跨语言事件检索缺乏大规模的标注数据,并且现有的跨语言预训练模型无法很好地表征文本中丰富的汉越对齐事件知识,不适用于该任务。因此,为了将汉越双语对齐的事件知识融入到多语言预训练语言模型中,该文提出了两个预训练方法,即事件要素掩码预训练以及跨语言事件对比预训练。在该文构造的汉越跨语言事件检索数据集和公开跨语言问答数据集上进行了实验,比基线提升1%~3%MAP值,2%~4%NDCG值,证明了该文方法的有效性。

Abstract

Chinese-Vietnamese cross-lingual event retrieval is a task to retrieve Vietnamese event news by query in Chinese. To incorporate Chinese-Vietnamese aligned event knowledge into a multilingual pre-trained language model, this paper proposes two pre-training methods, namely event element mask pre-training, and cross-lingual event comparison learning pre-training. Experiments were conducted on the Chinese-Vietnamese cross-lingual event retrieval dataset and the open cross-lingual question-and-answer dataset constructed in the paper, with results of MAP improvement by 1%~3% and NDCG improvements by 2%~4%.

关键词

事件预训练 / 跨语言事件检索 / 掩码语言模型 / 对比学习

Key words

event pre-training / cross-lingual event retrieval / masked language model / contrastive learning

引用本文

导出引用
吴少扬,余正涛,黄于欣,朱恩昌,高盛祥,邓同杰. 面向汉越跨语言事件检索的事件预训练方法. 中文信息学报. 2024, 38(4): 78-85
WU Shaoyang, YU Zhengtao, HUANG Yuxin, ZHU Enchang, GAO Shengxiang, DENG Tongjie. Event Pre-training for Chinese-Vietnamese Cross-lingual Event Retrieval. Journal of Chinese Information Processing. 2024, 38(4): 78-85

参考文献

[1] ZOSA E, GRANROTH WILDING M, PIVOVAROVA L. A comparison of unsupervised methods for ad hoc cross-lingual document retrieval[C]//Proceedings of the LREC Workshop on Cross-language Search and Summarization of Text and Speech. European Language Resources Association, 2020: 32-37.
[2] LITSCHKO R, GLAVA G, PONZETTO S P, et al. Unsupervised cross-lingual information retrieval using monolingual data only[C]//Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018: 1253-1256.
[3] BRASCHLER M, SCHUBLE P. Using corpus-based approaches in a system for multilingual information retrieval[J]. Information Retrieval, 2000, 3(3): 273-284.
[4] FRANZ M, MCCARLEY J S, WARD T. Ad-hoc, cross-language and spoken document information retrieval at IBM[C]//Proceedings of the TREC, 1999.
[5] BONAB H, SARWAR S M, ALLAN J. Training effective neural CLIR by bridging the translation gap[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020: 9-18.
[6] YU P, ALLAN J. A study of neural matching models for cross-lingual IR[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020: 1637-1640.
[7] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
[8] CONNEAU A, KHANDELWAL K, GOYAL N, et al. Unsupervised cross-lingual representation learning at scale[J]. arXiv preprint arXiv:1911.02116, 2019.
[9] LAMPLE G, CONNEAU A. Cross-lingual language model pretraining[J]. arXiv preprint arXiv:1901.07291, 2019.
[10] YU P, FEI H, LI P. Cross-lingual language model pretraining for retrieval[C]//Proceedings of the Web Conference, 2021: 1029-1039.
[11] BOSSELUT A, RASHKIN H, SAP M, et al. COMET: Commonsense transformers for automatic knowledge graph construction[J]. arXiv preprint arXiv:1906.05317, 2019.
[12] HWANG J D, BHAGAVATULA C, LE BRAS R, et al. Ccomet atomic: On symbolic and neural commonsense knowledge graphs[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(7): 6384-6392.
[13] SAP M, LEBRAS R, ALLAWAY E, et al. ATOMIC: An atlas of machine commonsense for if-then reasoning[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(01): 3027-3035.
[14] HAN R, REN X, PENG N. Econet: Effective continual pretraining of language models for event temporal reasoning[J]. arXiv preprint arXiv:2012.15283, 2020.
[15] LIN S T, CHAMBERS N, DURRETT G. Conditional generation of temporally-ordered event sequences[J]. arXiv preprint arXiv:2012.15786, 2020.
[16] YU C, ZHANG H, SONG Y, et al. CoCoLM: Complex commonsense enhanced language model with discourse relations[C]//Proceedings of the Association for Computational Linguistics: ACL, 2022: 1175-1187.
[17] ZHOU Y, GENG X, SHEN T, et al. EventBERT: A pre-trained model for eventelation reasoning[C]//Proceedings of the ACM Web Conference, 2022: 850-859.
[18] ZHOU Y, SHEN T, GENG X, et al. ClarET: Pre-training a correlation-aware context-to-event transformer for event-centric generation and classification[J]. arXiv preprint arXiv:2203.02225, 2022.
[19] MAJEWSKA O, VULIC' I, GLAVASˇ G, et al. Verb knowledge injection for multilingual event processing[J]. arXiv preprint arXiv:2012.15421, 2020.
[20] SCHULER K K. VerbNet: A broad-coverage, comprehensive verb lexicon[M]. University of Pennsylvania, 2005.
[21] BAKER C F, FILLMORE C J, LOWE J B. The Berkeley framenet project[C]//Proceedings of the COLING: The 17th International Conference on Computational Linguistics, 1998.
[22] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Advances in Neural Information Processing Systems, 2017: 30-38.
[23] JOSHI M, CHEN D, LIU Y, et al. SpanBERT: Improving pre-training by representing and predicting spans[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 64-77.
[24] SUN Y, WANG S, LI Y, et al. ERNIE: Enhanced representation through knowledge integration[J]. arXiv preprint arXiv:1904.09223, 2019.
[25] HADSELL R, CHOPRA S, LECUN Y. Dimensionality reduction by learning an invariant mapping[C]//Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2006, 2: 1735-1742.
[26] WU Z, XIONG Y, YU S X, et al. Unsupervised feature learning via non-parametric instance discrimination[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 3733-3742.
[27] OORD A, LI Y, VINYALS O. Representation learning with contrastive predictive coding[J]. arXiv preprint arXiv:1807.03748, 2018.
[28] HJELM R D, FEDOROV A, LAVOIE-MARCHILDON S, et al. Learning deep representations by mutual information estimation and maximization[J]. arXiv preprint arXiv:1808.06670, 2018.
[29] CHEN T, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations[C]//Proceedings of the International Conference on Machine Learning. PMLR, 2020: 1597-1607.
[30] HE K, FAN H, WU Y, et al. Momentum contrast for unsupervised visual representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 9729-9738.
[31] QIU J, CHEN Q, DONG Y, et al. Gcc: Graph contrastive coding for graph neural network pre-training[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020: 1150-1160.
[32] YOU Y, CHEN T, SUI Y, et al. Graph contrastive learning with augmentations[J]. Advances in Neural Information Processing Systems, 2020, 33: 5812-5823.
[33] ZHU Y, XU Y, YU F, et al. Deep graph contrastive representation learning[J]. arXiv preprint arXiv:2006.04131, 2020.
[34] YEH Y T, CHEN Y N. QAInfomax: Learning robust question answering system by mutual information maximization[J]. arXiv preprint arXiv:1909.00215, 2019.
[35] CUI W, ZHENG G, WANG W. Unsupervised natural language inference via decoupled multimodal contrastive learning[J]. arXiv preprint arXiv:2010.08200, 2020.
[36] PENG H, GAO T, HAN X, et al. Learning from context or names?: An empirical study on neural relation extraction[J]. arXiv preprint arXiv:2010.01923, 2020.
[37] WANG Z, WANG X, HAN X, et al. CLEVE: Contrastive pre-training for event extraction[J]. arXiv preprint arXiv:2105.14485, 2021.
[38] GAO L, CALLAN J. Unsupervised corpus aware language model pre-training for dense passage retrieval[J]. arXiv preprint arXiv:2108.05540, 2021.
[39] YANG E, NAIR S, CHANDRADEVAN R, et al. C3: Continued pretraining with contrastive weak supervision for cross language ad-hoc retrieval[J]. arXiv preprint arXiv:2204.11989, 2022.
[40] KHATTAB O, ZAHARIA M. ColBERT: Efficient and effective passage search via contextualized late interaction over bert[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020: 39-48.
[41] ASAI A, KASAI J, CLARK J H, et al. XOR QA: Cross-lingual open-retrieval question answering[J]. arXiv preprint arXiv:2010.11856, 2020.

基金

国家自然科学基金(U21B2027,61972186,61732005,61866019);云南省重大科技专项(202002AD080001,202202AD080003,202103AA080015);云南省高新技术产业专项(201606)
PDF(3515 KB)

Accesses

Citation

Detail

段落导航
相关文章

/