NCIFD: 面向大模型的民族文化微调数据集

罗鹤,张廷,孙媛,朋毛才让,达哇才仁

PDF(3280 KB)
PDF(3280 KB)
中文信息学报 ›› 2025, Vol. 39 ›› Issue (2) : 41-51.
语言资源建设与应用

NCIFD: 面向大模型的民族文化微调数据集

  • 罗鹤1,2,张廷1,2,孙媛1,2,朋毛才让1,2,达哇才仁1,2
作者信息 +

NCIFD: National Culture Instruction-Following Dataset for Large Language Models

  • LUO He1,2, ZHANG Ting1,2, SUN Yuan1,2, PENGMAO Cairang1,2, DAWA Cairen1,2
Author information +
History +

摘要

在大语言模型快速发展的挑战下,民族文化研究及传播需要更多的投入。其中,构建高质量的民族文化数据集不仅能促进民族文化传播,还能提高大语言模型在特定文化环境中的精准度和适应性。为了构建高质量的民族文化指令遵循数据集,该文面向民族文化领域,收集整理了《中国民族百科全书》《中国服饰大典》等18本民族文化相关书籍,进行清洗过滤之后,基于Self-QA框架,使用大语言模型自动生成问答对。同时根据书籍的内容,人工编写了58条民族文化种子指令集,利用这些种子指令集,基于Self-Instruct框架,使用GPT-3.5自动生成指令、输入和输出样本。将两种方式获取的数据集通过多种方式过滤,构建了民族文化指令微调数据集NCIFD(National Culture Instruction-Following Dataset)。通过在ChatGLM-6B、LLaMA-2-7B等主流开源模型上进行了微调实验,实验结果显示,微调Base模型回复准确性与Chat版本模型相比平均提升了6.6%,验证了数据集的有效性和可用性。该数据集为面向民族文化领域的大模型微调提供了支撑,对于推动民族文化在自然语言处理领域的发展具有重要意义。作者将NCIFD部分资源开放供研究使用: https://github.com/letsgoLakers/NCIFD。

Abstract

Building high-quality national culture datasets can not only promote the spread of national culture but also improve the accuracy and adaptability of large language models in specific cultural environments. To construct a high-quality instruction-following dataset for national culture, this paper collects and organizes 18 books related to national culture, such as the "Encyclopedia of Chinese Ethnic Groups" and the "Chinese Costume Canon". After cleaning and filtering, question-answer pairs are automatically generated using a large language model based on the Self-QA framework. Additionally, 58 seed instruction sets are manually compiled according to the contents of the books. With these seed instructions, instruction, input, and output samples are automatically generated with GPT-3.5 based on the Self-Instruct framework. The data collected through these two methods are filtered in various ways to construct the National Culture Instruction-Following Dataset (NCIFD, https: //github.com/letsgoLakers/NCIFD). Experiments on fine-tuned open-source models such as ChatGLM-6B and Llama-2-7B show that the accuracy of responses is improved by an average of 6.6% compared to the Chat version models.

关键词

大语言模型 / 民族文化 / 指令微调 / 数据集

Key words

Large Language Models / national culture / instruction fine-tuning / dataset

引用本文

导出引用
罗鹤,张廷,孙媛,朋毛才让,达哇才仁. NCIFD: 面向大模型的民族文化微调数据集. 中文信息学报. 2025, 39(2): 41-51
LUO He, ZHANG Ting, SUN Yuan, PENGMAO Cairang, DAWA Cairen. NCIFD: National Culture Instruction-Following Dataset for Large Language Models. Journal of Chinese Information Processing. 2025, 39(2): 41-51

参考文献

[1] KHETAN A, KARNIN Z. schuBERT: Optimizing elements of BERT[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 2807-2818.
[2] KENTON J D M W C,TOUTANOVA L K. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT, 2019: 4171-4186.
[3] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J].OpenAI Blog, 2019, 1(8): 9-10.
[4] KAPLAN J,MCCANDLISH S, HENIGHAN T, et al. Scaling laws for neural language models[C]//Proceedings of ACL, 2020: 1-30.
[5] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020, 33: 1877-1901.
[6] NARANG S, CHOWDHERY A. Pathways language model (palm): Scaling to 540 billion parameters for breakthrough performance[J]. arXiv Preprine arXij: 2204.02311, 2022.
[7] XIE S M, RAGHUNATHAN A, LIANG P, et al. An explanation of in-context learning as implicit Bayesian inference[C]//Proceedings of ICLR, 2022: 1-31.
[8] OUYANG L, WU J, JIANG X, et al. Training language models to follow instructions with human feedback[C]//Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022, 35: 27730-27744.
[9] ACHIAM J, ADLER S, AGARWAL S, et al. GPT-4 technical report[J]. arXiv preprint arXiv: 2303.08774,2023.
[10] LUO Z, XU C, ZHAO P, et al.Wizard Coder: Empowering code large language models with evol-instruct[C]//Proceedings of the 12th International Conference on Learning Representations, 2023: 1-13.
[11] TOUVRON H, MARTIN L, STONE K, et al. LLaMA 2: Open foundation and fine-tuned chat models[J]. arXiv Preprint arXiv: 2307.09288,2023.
[12] HOULSBY N, GIURGIU A, JASTRZEBSKI S, et al. Parameter-efficient transfer learning for NLP[C]//Proceedings of the International Conference on Machine Learning. PMLR, 2019: 2790-2799.
[13] LI X L, LIANG P. Prefix-tuning: Optimizing continuous prompts for generation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021: 4582-4597.
[14] LESTER B, AL-RFOU R, CONSTANT N. The power of scale for parameter-efficient prompt tuning[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2021: 3045-3059.
[15] HU E J, WALLIS P, ALLEN ZHU Z, et al.LoRA: low-rank adaptation of large language models[C]//Proceedings of the International Conference on Learning Representations, 2021: 1-16.
[16] ZHANG S, ZHANG X, WANG H, et al. Multi-scale attentive interaction networks for Chinese medical question answer selection[J]. IEEE Access, 2018, 6: 74061-74071.
[17] HE J, FU M,TU M. Applying deep matching networks to Chinese medical question answering: A study and a dataset[J]. BMC Medical Informatics and Decision Making, 2019, 19: 91-100.
[18] LI Y, ZHANG Y, ZHAO Z, et al. CSL: A large-scale Chinese scientific literature dataset[C]//Proceedings of the 29th International Conference on Computational Linguistics, 2022: 3917-3923.
[19] Github. Chinese alpaca dataset[EB/OL]. GitHub, https://github.com/hikariming/alpaca_chinese_dataset[2024-05-30].
[20] Github. MNBVC: Massive never-ending BT vast Chinese corpus[EB/OL]. https://github.com/esbatmop/MNBVC[2024-05-30].
[21] ZHU Q, HUANG K, ZHANG Z, et al.Crosswoz: A large-scale Chinese cross-domain task-oriented dialogue dataset[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 281-295.
[22] ZHOU H, ZHENG C, HUANG K, et al.KdConv: A Chinese multi-domain dialogue dataset towards multi-turn knowledge-driven conversation[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 7098-7108.
[23] WANG Y,KORDI Y, MISHRA S, et al. Self-Instruct: Aligning language models with self-generated instructions[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023: 13484-13508.
[24] PENG B, LI C, HE P, et al. Instruction tuning with GPT-4[J]. arxiv Preprint arxiv: 2304.03277,2023.
[25] SUN Z, SHEN Y, ZHOU Q, et al. Principle-driven self-alignment of language models from scratch with minimal human supervision[C]//Proceedings of the 37th International Conference on Neural Information Processing Systems, 2024: 2511-2565.
[26] ZHANG X, YANG Q. Self-QA: Unsupervised knowledge guided language model alignment[J]. arXiv Preprint arXiv: 2305. 11952, 2023.
[27] 云峰. 《中国民族百科全书》出版[J]. 民族教育研究, 2016, 27(1): 144-156.
[28] 大禹. 《中国服饰大典》出版[J]. 广西民族学院学报(哲学社会科学版), 2000(3): 123-129.
[29] 潘山. 少数民族传统服饰文化研究: 评《中国民族服饰文化研究》[J]. 毛纺科技, 2020, 48(1): 97-98.
[30] 刘玉红. 一本有特色的古代文化史读物: 简评《中国古代文化会要》[J]. 浙江大学学报(人文社会科学版), 2007(5): 86-97.
[31] 刘孝蓉. 文化资本视角下的民族旅游村寨可持续发展研究[D]. 武汉: 中国地质大学硕士学位论文, 2013.
[32] HU E J, WALLIS P, ALLEN ZHU Z, et al.LoRA: Low-rank adaptation of large language models[C]//Proceedings of the International Conference on Learning Representations, 2021: 13484-13508.
[33] ZENG A, LIU X, DU Z, et al. GLM-130B: An open bilingual pre-trained model[C]//Proceedings of ICLR, 2023: 1-19.
[34] YANG A, XIAO B, WANG B, et al.Baichuan 2: Open large-scale language models[J]. ArXiv Preprint ArXiv: 2307.09288,2023.
[35] BAI J, BAI S, CHU Y, et al.Qwen technical report[J]. arxiv Preprint arxiv: 2309,16609,2023.

罗鹤(2001—),共同第一作者,硕士研究生,主要研究领域为自然语言处理、知识图谱、机器阅读理解。
E-mail: 407987622@qq.com张廷(1980—),共同第一作者,实验师,主要研究领域为自然语言处理、信息抽取、机器阅读理解。
E-mail: tozhangting@126.com孙媛(1979—),通信作者,教授,主要研究领域为大语言模型。
E-mail: tracy.yuan.sun@gmail.com

基金

国家社会科学基金(22&ZD035);国家自然科学基金(61972436);中央民族大学项目(GRSCP202316,2023QNYL22,2024GJYY43)
PDF(3280 KB)

Accesses

Citation

Detail

段落导航
相关文章

/