面向中文医疗问答网站的相似问题检索研究

王保成,刘利军,黄青松

PDF(6249 KB)
PDF(6249 KB)
中文信息学报 ›› 2022, Vol. 36 ›› Issue (6) : 135-145.
信息检索

面向中文医疗问答网站的相似问题检索研究

  • 王保成1,刘利军1,黄青松1,2
作者信息 +

Retrieval of Similar Questions from Chinese MedicalQuestion Answering Websites

  • WANG Baocheng1, LIU Lijun1, HUANG Qingsong1,2
Author information +
History +

摘要

医疗问答平台主要通过关键词检索来服务,但其缺点是难以应对文本中多样化表达、否定词较多等特点,且不能充分根据用户的语义查询,使查询结果中有大量无关项。因此该文先用基于改进文本卷积神经网络的哈希生成模型,进行相似问题的语义检出,以更好地处理文本中的多样化表达、否定词较多等现象。然后,用更精确的文本匹配模型对检出集合进行过滤和排序,通过集成学习构建该模型。模型先集成Siamese-BERT模型,该模型利用孪生网络,并用BERT作为基础模型,能更好地进行语义抽取;接着集成BERT-Match模型,该模型借助BERT的多头注意力机制,能更好地捕捉问句间的局部相关性。最后,用梯度下降提升树将语义特征及统计特征结合,使模型更准确。实验结果表明,该文方法在进行相似问题检出和文本匹配时能得到更好的结果。

Abstract

To improve the existing key word based retrieval in medical question-answering platform, a hash generation model based on an improved text convolution neural network is used for the semantic detection of similar problems to better deal with the phenomena of diversified expressions and more negative words in the text. Then, the detection set is filtered and sorted with a more accurate text matching model. The whole model is constructed within the ensemble learning framework. First, the Siamese-BERT model is adopted to better extract semantics. Then, the BERT-Match model is applied to better capture the local correlation between questions with the help of multi-attention mechanism of BERT. Finally, the gradient descent boosting tree is used to combine the semantic features and statistical features. Experiments show that this method can get better results in similar problem detection and text matching.

关键词

医疗问答平台 / 文本卷积神经网络 / 文本匹配模型 / 集成学习

Key words

medical question and answer platform / text convolutional neural network / text matching model / integrated learning

引用本文

导出引用
王保成,刘利军,黄青松. 面向中文医疗问答网站的相似问题检索研究. 中文信息学报. 2022, 36(6): 135-145
WANG Baocheng, LIU Lijun, HUANG Qingsong. Retrieval of Similar Questions from Chinese MedicalQuestion Answering Websites. Journal of Chinese Information Processing. 2022, 36(6): 135-145

参考文献

[1] 邵双,刘芬,袁玉婷,等.我国在线医疗信息服务平台现状分析: 以39健康网、寻医问药网和好大夫在线为例[J].现代商贸工业,2014, 26(07):162-164.
[2] 韩晟, 陈衍, 彭红波,等.基于Lucene搜索引擎的非结构化电子病历检索系统[J].中国医疗设备, 2012, 27(011):64-66.
[3] 胡恒文,高智勇,王辉.基于 Clucene 的电子病历全文检索系统研究与设计[J].计算机与数字工程,2014,42(3): 521-525.
[4] Bahga A,Madisetti V K. A cloud-based approach for interoperable electronic health records (EHRs)[J]. IEEE Journal of Biomedical and Health Informatics, 2013, 17(5): 894-906.
[5] Qiu Y,Frei H P. Concept based query expansion[C]//Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1993: 160-169.
[6] Bendersky M,Metzler D,Croft W B. Learning concept importance using a weighted dependence model[C]//Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, 2010: 31-40.
[7] Cai R,Zhu B,Ji L,et al. An CNN-LSTM attention approach to understanding user query intent from online health communities[C]//Proceedings of the IEEE International Conference on Data Mining Workshops. IEEE, 2017: 430-437.
[8] Hliaoutakis A,Varelas G,Petrakis E G M,et al. MedSearch: a retrieval system for medical information based on semantic similarity [C]//Proceedings of the International Conference on Theory and Practice of Digital Libraries. Springer, Berlin, Heidelberg, 2006: 512-515.
[9] Cai H,Yan C,Yin A,et al. Question recommendation in medical community-based question answering[C]//Proceedings of the International Conference on Neural Information Processing. Springer,Cham,2017: 228-236.
[10] Li Y, Yao L, Du N, et al. Finding similar medical questions from question answering web sites [J]. arXiv preprint arXiv:1810.05983,2018.
[11] Li Z,Sun M. Punctuation as implicit annotations for Chinese word segmentation[J]. Computational Linguistics, 2009, 35(4): 505-512.
[12] Zhang H P,Yu H K,Xiong D Y, et al. HHMM-based Chinese lexical analyzer ICTCLAS [C]//Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing-Volume 17. Association for Computational Linguistics,2003: 184-187.
[13] Mikolov T,Sutskever I,Chen K,et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of the Advances in Neural Information Processing Systems, 2013: 3111-3119.
[14] Song Y,Shi S,Li J,et al. Directional skip-gram: Explicitly distinguishing left and right context for word embeddings[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2018: 175-180.
[15] Kim Y. Convolutional neural networks for sentence classification[J]. arXiv preprint arXiv: 14 08.5882, 2014.
[16] Devlin J,Chang M W,Lee K,et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J].arXiv preprint arXiv:1810.04805, 2018.
[17] Vaswani A,Shazeer N,Parmar N,et al. Attention is all you need[C]//Proceedings of the Advances in Neural Information Processing Systems, 2017: 5998-6008.
[18] Wei J W,Zou K. Eda: Easy data augmentation techniques for boosting performance on text classification tasks[J].arXiv preprint arXiv:1901.1119,2019.
[19] Manku G S,Jain A,Das Sarma A. Detecting nearduplicates for web crawling[C]//Proceedings of the 16th International Conference on World Wide Web, 2007: 141-150.
[20] Szegedy C,Liu W,Jia Y,et al. Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 1-9.
[21] Ioffe S,Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift[J]. arXiv preprint arXiv:1502.03167,2015.
[22] Chen T,Guestrin C. Xgboost: A scalable tree boosting system[C]//Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, 2016: 785-794.
[23] Word2Vec:Chinese word vectors 中文词向量[EB/OL].https://github.com/Em bedding/Chinese-Word-Vectors[2020-03-03].
[24] Gao Y,Song F, Xie X, et al. Implicit semantic text retrieval and distributed implementation for rural medical care[C]//Proceedings of the 4th International Conference on Cloud Computing and Intelligence Systems. IEEE,2016: 264-267.
[25] Murata M, Nagano H, Mukai R, et al. BM25 With Exponential IDF for Instance Search[J]. IEEE Transactions on Multimedia, 2014, 16(6):1690-1699.
[26] 黄甜甜. 基于医学知识库扩展的深度医疗检索模型研究[D]. 山东:山东大学硕士学位论文, 2019.
[27] 王月瑶.面向医疗文本检索的查询重构技术研究与实现[D].上海:华东师范大学硕士学位论文,2018
[28] Cui Y, Che W, Liu T, et al. Pre-training with whole word masking for Chinese BERT[J]. arXiv preprint arXiv:1906.08101,2019.
[29] Wang K,Yang B,Xu G, et al. Medical question retrieval based on siamese neural network and transfer learning method[C]//Proceedings of the International Conference on Database Systems for Advanced Applications. Springer,Cham,2019: 49-64.
[30] Zhang S,Zhang X,Wang H,et al. Chinese medical question answer matching using end-to-end character-level multi-scale CNNs[J]. Applied Sciences, 2017,7(8): 767.

基金

国家自然科学基金(81860318, 81560296)
PDF(6249 KB)

Accesses

Citation

Detail

段落导航
相关文章

/