当前,关于大规模语言模型,例如,InstructGPT的研究主要聚焦在自由形式生成任务上,而忽略了在结构化抽取任务上的探索。为了让未来的工作在结构化抽取任务上有一个全面的认知,该文在零样本和少样本设定下,全面分析了InstructGPT在基础的结构化抽取任务,命名实体识别上的表现。为了让结论更加可靠,该文的实验同时包含了生物医学领域和通用领域的常规和嵌套数据集。实验结果表明,InstructGPT在零样本学习上的性能只能达到微调小规模语言模型的11%~56%,增加少量样本也最多只能提升至72%。为了探究InstructGPT在命名实体识别上表现性能不佳的原因,该文通过分析模型的输出,发现接近50%的句子都存在无效生成的问题。另外,由于无效生成会同时导致“虚假错误预测”和“虚假正确预测”,解决生成无效问题并不能保证性能的提升。此外,InstructGPT抽取嵌套实体的能力还是有待提高,抽取嵌套实体的比例也偏低。因此,用InstructGPT解决命名实体识别任务,除了要保证生成的有效性,还需要更加深入地研究才能找到行之有效的方法。
Abstract
Currently, the research on Large Language Models (LLMs), such as InstructGPT, is primarily focused on free-form generation tasks, while the exploration in structured extraction tasks has been overlooked. In order to gain a deep understanding of LLMs on structured extraction tasks, this paper analyzes InstructGPT's performance on named entity recognition (NER), one of the fundamental structured extraction tasks, in both zero-shot and few-shot settings. To ensure the reliability of the findings, the experiments cover common and nested datasets from both biomedical domain and general domain. The results demonstrate that InstructGPT's performance on zero-shot NER achieves 11% to 56% of the performance by a finetuned small-scaled model. To explore why InstructGPT struggles with NER, this paper examines the outputs, finding invalid generation for 50% of them. Besides, the occurrence of both "false-negative" and "false-positive" predictions makes it difficult to improve performance by only addressing the invalid generation. Therefore, in addition to ensuring the validity of generated outputs, further research still should focus on finding effective ways of using InstructGPT in this area.
关键词
大规模语言模型 /
命名实体识别 /
上下文学习 /
思维链
{{custom_keyword}} /
Key words
large language model /
named entity recognition /
in-context learning /
chain-of-thought
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[J].arXiv preprint arXiv:2005.14165, 2020.
[2] CHEN M,TWOREK J, JUN H, et al. Evaluating large language models trained on code[J]. arXiv preprint arXiv:2107.03374, 2021.
[3] OUYANG L, WU J, JIANG X, et al. Training language models to follow instructions with human feedback[J].arXiv e-prints, 2022: arXiv: 2203.02155.
[4] CHOWDHERY A, NARANG S, DEVLIN J, et al. Palm: Scaling language modeling with pathways[J]. arXiv preprint arXiv:2204.02311, 2022.
[5] QIN C, ZHANG A, ZHANG Z, et al. IsChatGPT a general-purpose natural language processing task solver?[J]. arXiv preprint arXiv:2302.06476, 2023.
[6] TOUVRON H, LAVRIL T, IZACARD G, et al. Llama: Open and efficient foundation language models[J]. arXiv preprint arXiv:2302.13971, 2023.
[7] LIU J, SHEN D, ZHANG Y, et al. What makes good in-Context examples for GPT-3?[C]//Proceedings of Deep Learning Inside Out: The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures. 2022: 100-114.
[8] WEI J, WANG X, SCHUURMANS D, et al. Chain of thought prompting elicits reasoning in large language models[J].arXiv preprint arXiv:2201.11903, 2022.
[9] GUTIRREZ B J, MCNEAL N, WASHINGTON C, et al. Thinking about gpt-3 in-context learning for biomedical ie? think again[J]. arXiv preprint arXiv:2203.08410, 2022.
[10] LIN B Y, LEE D H, SHEN M, et al.TriggerNER: Learning with entity triggers as explanations for named entity recognition[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 8503-8511.
[11] GU Y,TINN R, CHENG H, et al. Domain-specific language model pretraining for biomedical natural language processing[J]. ACM Transactions on Computing for Healthcare, 2021, 3(1): 1-23.
[12] LI J, SUN Y, JOHNSON R J, et al.BioCreative V CDR task corpus: A resource for chemical disease relation extraction[J]. Database, 2016, 1: 10.
[13] SMITH L, TANABE L K,KUO C J, et al. Overview of BioCreative II gene mention recognition[J]. Genome Biology, 2008, 9(2): 1-19.
[14] SANG E T K, DEMEULDER F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition[C]//Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003: 142-147.
[15] PRADHAN S,MOSCHITTI A, XUE N, et al. Towards robust linguistic analysis using ontonotes[C]//Proceedings of the 17th Conference on Computational Natural Language Learning. 2013: 143-152.
[16] ZHANG S, QIN Y, HOU W J, et al. Word segmentation and named entity recognition forsighan bakeoff3[C]//Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. 2006: 158-161.
[17] WEISCHEDEL R, PRADHAN S, RAMSHAW L, et al. Ontonotes release 4.0[C]//Proceedings of the LDC, Philadelphia, Penn.: Linguistic Data Consortium, 2011.
[18] DODDINGTON G R, MITCHELL A,PRZYBOCKI M A, et al. The automatic content extraction (ace) program-tasks data and evaluation[C]//Proceedings of the Lrec. 2004, 2(1): 837-840.
[19] WALKER C, STRASSEL S,MEDERO J, et al. ACE 2005 multilingual training corpus[J]. Linguistic Data Consortium, Philadelphia, 2006, 57: 45.
[20] LEE J, YOON W, KIM S, et al.BioBERT: A pre-trained biomedical language representation model for biomedical text mining[J]. Bioinformatics, 2020, 36(4): 1234-1240.
[21] ZHU E, LI J. Boundary smoothing for named entity recognition[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics . 2022: 7096-7108.
[22] STRAKOV J, STRAKA M, HAJIA^ J. Neural architectures for nested NER through linearization[J]. arXiv preprint arXiv:1908.06926, 2019.
[23] HOFFMANN J, BORGEAUD S, MENSCH A, et al. Training compute-optimal large language models[J].arXiv preprint arXiv:2203.15556, 2022.
[24] ZHANG Z, ZHANG A, LI M, et al. Automatic chain of thought prompting in large language models[J].arXiv preprint arXiv:2210.03493, 2022.
[25] MEKALA D, WOLFE J, ROY S. ZEROTOP: Zero-shot task-oriented semantic parsing using large language models[J]. arXiv preprint arXiv:2212.10815, 2022.
[26] SAHU G, RODRIGUEZ P,LARADJI I, et al. Data Augmentation for Intent Classification with Off-the-shelf Large Language Models[C]//Proceedings of the 4th Workshop on NLP for Conversational AI. 2022: 47-57.
[27] YE J, GAO J, LI Q, et al.Zerogen: Efficient zero-shot learning via dataset generation[J]. arXiv preprint arXiv:2202.07922, 2022.
[28] AGRAWAL M,HEGSELMANN S, LANG H, et al. Large language models are few-shot clinical information extractors[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2022: 1998-2022.
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}
基金
国家自然科学基金(62022027);国家重点研究与发展计划 (2022CSJGG0801)
{{custom_fund}}