基于自监督学习的委婉语识别方法

胡玉雪,吴明民,沙灜,曾智,张瑜琦

PDF(3295 KB)
PDF(3295 KB)
中文信息学报 ›› 2023, Vol. 37 ›› Issue (10) : 55-63,75.
信息抽取与文本挖掘

基于自监督学习的委婉语识别方法

  • 胡玉雪1,2,3,4,吴明民1,2,3,4,沙灜1,2,3,4,曾智1,2,3,4,张瑜琦1,2,3,4
作者信息 +

A Self-Supervised Learning Method for Euphemism Identification

  • HU Yuxue1,2,3,4, WU Mingmin1,2,3,4, SHA Ying1,2,3,4, ZENG Zhi1,2,3,4, ZHANG Yuqi1,2,3,4
Author information +
History +

摘要

委婉语常被用于社交媒体和暗网交易市场,以其表面含义掩盖潜在含义来逃避平台的监管,例如,用“weed”(杂草)代替“marijuana”(大麻)进行非法交易。委婉语识别是将给定的委婉语映射到特定的目标词(潜在含义)。当前委婉语识别的研究没有获得广泛关注,一方面缺乏有效标注的数据集,另一方面现有方法主要关注委婉语句子中的单个词汇,忽略了委婉语的语境信息。针对上述问题,该文提出了双层自监督学习模型——DSLM(Double Self-supervised Learning Method)用于委婉语识别: 外层自监督学习框架用来自动构建含标签数据集,以解决缺乏有效标注数据集的问题;内层使用语境对比学习方法,利用委婉语语境信息,拉近委婉语语境表示和目标词的语义距离。实验表明,该方法优于当前最先进的方法,且结果更稳定、模型收敛更快。

Abstract

Euphemisms are commonly used on social media and darknet marketplaces to evade platform regulations by masking their true meanings with innocent ones. For instance, “weed” is used instead of “marijuana” for illicit transactions. The task of euphemism identification is to map a given euphemism to its specific target word. Despite its significance, euphemism identification research has not received much attention, partly due to the lack of annotated datasets and the current methods primarily focusing on individual words rather than the context of euphemistic expressions. To address these issues, this paper proposes a double self-supervised learning model, DSLM, for euphemism identification. The outer self-supervised learning framework is used to automatically construct a labeled dataset to tackle the problem of insufficient annotated data. And the inner framework utilizes a contrastive learning approach with the context of euphemisms to narrow the semantic distance between the euphemistic context representation and the target word. The experiments demonstrate that the proposed approach outperforms the state-of-the-art methods, with more stable results and faster convergence of the model.

关键词

委婉语识别 / 自监督学习 / 对比学习

Key words

euphemism identification / self-supervised learning / contrastive learning

引用本文

导出引用
胡玉雪,吴明民,沙灜,曾智,张瑜琦. 基于自监督学习的委婉语识别方法. 中文信息学报. 2023, 37(10): 55-63,75
HU Yuxue, WU Mingmin, SHA Ying, ZENG Zhi, ZHANG Yuqi. A Self-Supervised Learning Method for Euphemism Identification. Journal of Chinese Information Processing. 2023, 37(10): 55-63,75

参考文献

[1] BAKHRIDDIONOVA D O. The needs of using euphemisms[J]. Mental Enlightenment Scientific Methodological Journal, 2021(06): 55-64.
[2] ZHU W, GONG H, BANSAL R, et al. Self-supervised euphemism detection and identification for content moderation[C]//Proceedings of the IEEE Symposium on Security and Privacy. 2021: 229-246.
[3] YUAN K, LU H, LIAO X, et al. Reading thieves' cant: Automatically identifying and understanding dark jargons from cybercrime marketplaces[C]//Proceedings of USENIX Security Symposium, 2018: 1027-1041.
[4] LI Z, DU X, LIAO X, et al. Demystifying the dark web opioid trade: Content analysis on anonymous market listings and forum posts[J]. Journal of Medical Internet Research, 2021, 23(2): e24486.
[5] TAKURO H, YUICHI S E I, TAHARA Y, et al. Codewords detection in microblogs focusing on differences in word use between two corpora[C]//Proceedings of International Conference on Computing, Electronics & Communications Engineering. IEEE, 2020: 103-108.
[6] JIANG C,FOYE J, BROADHURST R, et al. Illicit firearms and other weapons on darknet markets[J]. Trends and Issues in Crime and Criminal Justice, 2021(622): 1-20.
[7] SHA Y, SHI Z, LI R, et al. Resolving entity morphs based on character-word embedding[J]. Procedia Computer Science, 2017, 108: 48-57.
[8] YOU J, SHA Y, LIANG Q, et al. Morph resolution based on autoencoders combined with effective context information[C]//Proceedings of Computational Science-ICCS: 18th International Conference, 2018: 487-498.
[9] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of NIPS, 2013: 3111-3119.
[10] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv: 1301.3781, 2013.
[11] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL, 2019: 4171-4186.
[12] WENG L. Self-supervised representation learning [EB/OL]. https://lilianweng.github.io/lil-log/2019/11/10/self-supervised-learning.html.[2023-03-02].
[13] LIU X, ZHANG F,HOU Z, et al. Self-supervised learning: Generative or contrastive[J]. IEEE Transactions on Knowledge and Data Engineering, 2021, 35(1): 857-876.
[14] RADFORD A,NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[EB/OL].https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf[2023-03-02].
[15] VAN DENOORD A, VINYALS O. Neural discrete representation learning[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, 30: 6309-6318.
[16] HE K, FAN H, WU Y, et al. Momentum contrast for unsupervised visual representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 9729-9738.
[17] LEWIS M, LIU Y, GOYAL N, et al. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 7871-7880.
[18] ZHANG T, HUANG H, FENG C, et al. Self-supervised bilingual syntactic alignment for neural machine translation[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(16): 14454-14462.
[19] DOU Z Y, PENG N. Zero-shot commonsense question answering with cloze translation and consistency optimization[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(10): 10572-10580.
[20] YU W, XU H, YUAN Z, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the AAAI conference on artificial intelligence, 2021, 35(12): 10790-10797.
[21] HADSELL R, CHOPRA S, LECUN Y. Dimensionality reduction by learning an invariant mapping[C]// Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2006, 2: 1735-1742.
[22] BROMLEY J, GUYON I, LECUN Y, et al. Signature verification using a"siamese" time delay neural network[C]//Proceedings of the 6th International Conference on Neural Information Processing Systems, 1993, 6: 737-744.
[23] CHEN T,KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations[C]//Proceedings of the International Conference on Machine Learning. PMLR, 2020: 1597-1607.[24] GAO T, YAO X, CHEN D. SimCSE: Simple contrastive learning of sentence embeddings[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2021: 6894-6910.
[25] LI C, YU X, SONG S, et al.SimCTC: A simple contrast learning method of text clustering[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(11): 12997-12998.
[26] XIE E, DING J, WANG W, et al.Detco: Unsupervised contrastive learning for object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 8392-8401.
[27] NGUYEN T, LUU A T. Contrastive learning for neural topic model[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems, 2021, 34: 11974-11986.
[28] ZHOU Y, GENG X, SHEN T, et al. Improving zero-shot cross-lingual transfer for multilingual question answering over knowledge graph[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021: 5822-5834.
[29] LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization[C]//Proceedings of the 7th International Conference on Learning Representations, 2019: 1-18.
[30] VAN DERMAATEN L, HINTON G. Visualizing data using t-SNE[J]. Journal of Machine Learning Research, 2008, 9(86): 2579-2605.

基金

FundamentalResearchFundsfortheCentralUniversities(2662021JC008);国家自然科学基金(62272188);国家社会科学基金(19BSH022);内蒙古自治区科技重大专项(2021ZD0046)
PDF(3295 KB)

878

Accesses

0

Citation

Detail

段落导航
相关文章

/