面向国防科技领域的技术和术语语料库构建方法

冯鸾鸾,李军辉,李培峰,朱巧明

PDF(1200 KB)
PDF(1200 KB)
中文信息学报 ›› 2020, Vol. 34 ›› Issue (8) : 41-50.
语言资源建设

面向国防科技领域的技术和术语语料库构建方法

  • 冯鸾鸾,李军辉,李培峰,朱巧明
作者信息 +

Constructing a Technology and Terminology Corpus Oriented National Defense Science

  • FENG Luanluan, LI Junhui, LI Peifeng, ZHU Qiaoming
Author information +
History +

摘要

互联网存在海量的文献和科技信息,隐含着大量高价值情报。识别国防科技领域中的技术和术语可以为构建国防科技知识图谱奠定基础。该文基于此领域的海量军事文本,以维基百科中军事领域的新技术为基点采集语料,涵盖了新闻、文献和维基百科三种体裁。在分析军事技术文本特点的基础上制定了一系列标注规范,开展了大规模语料的标注工作,构建了一个面向国防科技领域的技术和术语语料库。该语料库共标注了479篇文章,包含24 487个句子和33 756个技术和术语。同时,该文探讨了模型预标注策略的可行性,并对技术和术语类别在不同体裁上的分布以及语料标注的一致性进行了统计分析。基于该语料库的实验表明,技术和术语识别性能F1值达到70.40%,为进一步的技术和术语识别研究提供了基础。

Abstract

Massive literature and science information on Internet can supply valuable intelligence. The detection of technology and terminology is fundamental for constructing oriented national defense science (ONDS) technology knowledge base. We analyze military text characteristics and design annotation guidelines for ONDS technology and terminology from massive internet content for a list of military emerging technology defined in Wikipedia. Based on the annotation guidelines, we conduct broad-scale corpus annotation process, and we construct a ONDS technology and terminology corpus which covers three genres of news, papers and Wikipedia. we finally annotated 479 articles with 24,487 sentences and 33,756 technologies and terminologies. Meanwhile, we explore the feasibility of model pre-annotating, analyze distribution of technology and terminology in different genres and calculate annotation consistency for the corpus. Experiment result based on the corpus show that the detection of technology and terminology achieves 70.40% F1 scores. The work presented in this paper builds foundations for detection of ONDS technology and terminology.

关键词

面向国防科技领域 / 技术和术语 / 标注规范 / 语料库

Key words

oriented national defense science / technology and terminology / annotation guidelines / corpus

引用本文

导出引用
冯鸾鸾,李军辉,李培峰,朱巧明. 面向国防科技领域的技术和术语语料库构建方法. 中文信息学报. 2020, 34(8): 41-50
FENG Luanluan, LI Junhui, LI Peifeng, ZHU Qiaoming. Constructing a Technology and Terminology Corpus Oriented National Defense Science. Journal of Chinese Information Processing. 2020, 34(8): 41-50

参考文献

[1] Brockett C, Dolan W B, Dolan B. Support vector machines for paraphrase identification and corpus construction[C]//Proceedings of the 3rd International Workshop on Paraphrasing (IWP2005), 2005.
[2] Dolan B, Brockett C. Automatically constructing a corpus of sentential paraphrases[C]//Proceedings of the 3rd International Workshop on Paraphrasing (IWP2005), 2005.
[3] Vincze V, Szarvas G, Farkas R, et al. The BioScope corpus: Biomedical texts annotated for uncertainty, negation and their scopes [J]. BMC Bioinformatics, 2008, 9(Suppl 11): S9-S9.
[4] Zou B W, Zhu Q M, Zhou G D. Negation and speculation identification in Chinese language[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 2015: 656-665.
[5] 周惠巍, 杨欢, 徐俊利, 等. 中文模糊限制信息范围语料库的研究与构建[J]. 中文信息学报, 2017, 31(3): 77-85.
[6] Lowe R, Pow N, Serban I V, et al. The Ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems[EB/OL]. (2016-2-4). https://arxiv.org/pdf/1506.08909v3.pdf.
[7] 奚雪峰, 褚晓敏, 孙庆英, 等. 汉语篇章微观话题结构建模与语料库构建[J]. 计算机研究与发展, 2017, 54(8): 1833-1852.
[8] Xue N W, Chiou F D, Palmer M.Building a large-scale annotated Chinese corpus[C]//Proceedings of the 19th International Conference on Computational Linguistics, 2002.
[9] Aksan Y, Aksan M, Koltuksuz A, et al. Construction of the Turkish national corpus (TNC)[C]//Proceedings of the 8th International Conference on Language Resources and Evaluation, 2012.
[10] Hu B T, Chen Q C, Zhu F Z. LCSTS: A large scale Chinese short text summarization dataset[EB/OL]. (2016-2-19).https://arxiv.org/pdf/1506.05865.pdf.
[11] Quan C Q, Ren F J. Construction of a blog emotion corpus for Chinese emotional expression analysis[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 2009: 1446-1454.
[12] 游正洋, 王亚强, 舒红平. 基于词性标注的中医症候名语料库[J]. 电子技术与软件工程, 2017, 21: 177-178.
[13] Jiang F, Xu S, Chu X M, et al. MCDTB: A macro-level Chinese discourse TreeBank[C]//Proceedings of the 27th International Conference on Computational Linguistics, 2018: 3493-3504.
[14] Chen J, Nie J Y. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval[C]//Proceedings of the 6th Conference on Applied Natural Language Processing, 2000: 21-28.
[15] Peng N Y, Dredze M. Named entity recognition for Chinese social media with jointly trained embeddings[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal, 2015: 548-554,
[16] 杨锦锋, 关毅, 何彬, 等. 中文电子病历命名实体和实体关系语料库构建[J]. 软件学报, 2016, 27(11): 2725-2746.
[17] 单赫源,张海粟,吴照林. 小粒度策略下基于CRFs的军事命名实体识别方法[J]. 装甲兵工程学院学报,2017,31(1): 84-89.
[18] 冯蕴天,张宏军,郝文宁.面向军事文本的命名实体识别[J].计算机科学,2015,42(07): 15-18,47.
[19] 王学锋,杨若鹏,朱巍.基于深度学习的军事命名实体识别方法[J].装甲兵工程学院学报,2018,32(04): 94-98.
[20] Carletta J. Assessing agreement on classification tasks: The Kappa statistic[J]. Computational Linguistics, 1996,22(2): 249-254.
[21] Hripcsak G, Rothschild A S. Agreement, the f-measure, and reliability in information retrieval[J]. Journal of the American Medical Informatics Association, 2005,12(3):296-298.
[22] Sang K T, Meulder D F. Introduction to the conll-2003 shared task: Language-independent named entity recognition[C]//Proceedings of the 2003 Conference on Natural Language Learning, 2003: 142-147.
[23] DoDgˇan R I, Leaman R, Lu Z. NCBI disease corpus: A resource for disease name recognition and concept normalization[J]. Journal of Biomedical Informatics, 2014, 47: 1-10.
[24] Yang J, Zhang Y. NCRF++: An open-source neural sequence labeling toolkit[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018.
[25] Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: A simple way to prevent neural networks from overfitting[J]. Journal of Machine Learning Research, 2014, 15(1): 1929-1958.

基金

国家自然科学基金(61836007,61472265,61876120)
PDF(1200 KB)

3090

Accesses

0

Citation

Detail

段落导航
相关文章

/