融合多特征的专利功效短语识别

罗艺雄,吕学强,游新冬

PDF(2110 KB)
PDF(2110 KB)
中文信息学报 ›› 2022, Vol. 36 ›› Issue (12) : 139-148.
信息抽取与文本挖掘

融合多特征的专利功效短语识别

  • 罗艺雄,吕学强,游新冬
作者信息 +

Patent Efficacy Phrase Recognition Based on Multiple Features

  • LUO Yixiong, LYU Xueqiang, YOU Xindong
Author information +
History +

摘要

专利功效短语是专利文本的关键信息,专利功效短语的识别是构建技术功效图的重要一环。针对现有功效短语识别方法精度较低的问题,该文提出融合多特征的专利功效短语识别方法。特征根据粒度大小,分为字符级特征和单词级特征,其中字符级特征包括字符、字符拼音和字符五笔,单词级特征为包含当前字符的单词的集合。字符级特征使用Word2Vec或BERT进行向量化,单词级特征通过注意力机制将集合中单词的向量表示融合为匹配输入序列的单词级特征向量。在嵌入层融合各特征向量并将其输入到BiLSTM或Transformer进行编码,最后使用CRF解码得到对应输入序列的标签序列。该文使用新能源汽车领域的专利作为语料,分析了不同的特征组合和神经网络模型对功效短语识别效果的影响。实验结果表明,最优特征组合为Word2Vec字向量、BERT字向量、五笔特征向量和单词级特征向量。在最优特征组合的基础上,使用BiLSTM+CRF识别专利功效词短语的F1值达到91.15%,识别效果优于现有方法,证明了该方法的有效性。

Abstract

Patent efficacy is one of the key information in the patent text. To identify the patent efficacy phrase, a multiple feature approach is proposed to combine both character-level features and word-level features. The character-level features include characters, character pinyin, and character wubi. The word-level features correspond to a collection of words containing those characters. Character-level features are vectorized by word2vec or BERT. Attention mechanism is used to fuse the word-level feature vectors in the input sequence. All feature vectors are concatenated as the input of BiLSTM (or Transformer)+CRF. Experiments on patents of new energy vehicles demonstrate the best 91.15% F1 value is achieved by BiLSTM+CRF with the combination of word2vec character vector, Bert character vector, wubi feature vector and word feature vector.

关键词

专利功效短语 / 多特征融合 / 神经网络 / 注意力机制

Key words

patent efficacy phrase / multi-scale features fusion / neural network / attention mechanism

引用本文

导出引用
罗艺雄,吕学强,游新冬. 融合多特征的专利功效短语识别. 中文信息学报. 2022, 36(12): 139-148
LUO Yixiong, LYU Xueqiang, YOU Xindong. Patent Efficacy Phrase Recognition Based on Multiple Features. Journal of Chinese Information Processing. 2022, 36(12): 139-148

参考文献

[1] 国家知识产权局.2020 年1~8月知识产权主要统计数据[EB/OL].https://www.cnipa.gov.cn/module/download/down.jsp?i_ID=152281 & colID=87[2020-09-04].
[2] 张兆锋,贺德方.专利技术功效图智能构建研究进展[J].情报理论与实践,2017,40(1): 139-144.
[3] 陈颖,张晓林.专利中技术词和功效词识别方法研究[J].现代图书情报技术,2011(12): 24-30.
[4] 张博培,杜永萍,马文建.基于隐马尔科夫模型的专利功效词识别[J].情报工程,2015,1(03): 81-89.
[5] 胡菊香,吕学强,刘秀磊,等.专利技术功效短语获取研究[J].科学技术与工程,2016,16(14): 228-235.
[6] Trappey A J C ,Trappey C V ,Govindarajan U H ,et al. Construction and validation of an ontology-based technology function matrix: technology mining of cyber physical system patent portfolios[J]. World Patent Information,2018,55(12): 19-24.
[7] 马建红,杨成,姚爽.中文专利复合功效短语获取[J].计算机工程与设计,2019,40(02): 449-454.
[8] 段庆锋,蒋保建.基于SAO结构的专利技术功效图构建研究[J].现代情报,2017,37(06): 48-54.
[9] 翟东升,张京先,胡等金.基于SAO结构和词向量的专利技术功效图自动构建研究[J].情报理论与实践,2020,43(03): 116-123.
[10] MA X,Hovy E . End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2016: 1064-1074.
[11] 李丽双,郭元凯.基于CNN-BiLSTM-CRF模型的生物医学命名实体识别[J].中文信息学报,2018,32(01): 116-122.
[12] 殷章志,李欣子,黄德根,等.融合字词模型的中文命名实体识别研究[J].中文信息学报,2019,33(11): 95-100.
[13] 陈茹,卢先领.融合空洞卷积神经网络与层次注意力机制的中文命名实体识别[J].中文信息学报,2020,34(08): 70-77.
[14] Devlin J,Chang M W,Lee K,et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT, 2019: 4171-4186.
[15] Yue Zh,Jie Y. Chinese ner using lattice LSTM[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 1554-1564.
[16] Xiaonan L,Hang Y,Xipeng Q,et al. FLAT: Chinese NER using flat-lattice transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 6836-6842.
[17] Ruotian M,Minlong P,QI Z H,et al. Simplify the usage of lexicon in Chinese NER[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 5951-5960.
[18] Vaswani A,Shazeer N,Parmar N,et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 5998-6008.

基金

国家自然科学基金(61671070);北京信息科技大学促进高校内涵发展科研水平提高项目(2019KYNH226);北京信息科技大学“勤信人才”培育计划项目(QXTCPB201908);北京市教委科研计划资助项目(KM202111232001)
PDF(2110 KB)

1766

Accesses

0

Citation

Detail

段落导航
相关文章

/