跨数据源论文集成

张帆进,顾晓韬,姚沛然,唐杰

PDF(2988 KB)
PDF(2988 KB)
中文信息学报 ›› 2018, Vol. 32 ›› Issue (9) : 84-92,131.
信息抽取与文本挖掘

跨数据源论文集成

  • 张帆进,顾晓韬,姚沛然,唐杰
作者信息 +

Conflating Papers across Different Data Sources

  • ZHANG Fanjin, GU Xiaotao, YAO Peiran, TANG Jie
Author information +
History +

摘要

该文研究跨数据源的论文集成问题,旨在将不同数据源中的同一论文匹配起来。该文提出了两个算法来解决论文匹配的问题,第一个算法(MHash)利用哈希算法来加速匹配,第二个算法(MCNN)利用卷积神经网络(CNN)来提高匹配的准确率。实验表明,结合论文的各种属性,MHash能够在快速得到匹配结果的同时,保持较高的准确率(93%+),而MCNN能够达到非常高的准确率(98%+)。同时,设计了一个针对大规模论文匹配的异步搜索框架,在15天内得到了64 639 608对AMiner和MAG论文的匹配结果。论文匹配结果和AMiner、MAG的全部论文数据已作为公开数据集发布。

Abstract

This paper studies conflating papers across different data sources. We propose two algorithms for paper matching. The first algorithm (MHash) employs hashing technique to accelerate matching process. The second one (MCNN) leverages convolutional neural network (CNN) to improve matching accuracy. Experimental results show that,by combining different attributes of papers,MHash is able to execute matching process quickly,yielding a good accuracy (93%+) at the same time. Besides,MCNN can achieve more satisfactory accuracy (98%+). Meanwhile,we design an asynchronous search framework for largescale paper matching problem. Finally,we obtain 64,639,608 matching pairs of AMiner papers and MAG papers within 15 days. The matching results and all AMiner and MAG publication data have been published.

关键词

数据集成 / 卷积神经网络 / 哈希学习 / 网络爬虫

Key words

data integration / convolutional neural network / learning to hash / web crawler

引用本文

导出引用
张帆进,顾晓韬,姚沛然,唐杰. 跨数据源论文集成. 中文信息学报. 2018, 32(9): 84-92,131
ZHANG Fanjin, GU Xiaotao, YAO Peiran, TANG Jie. Conflating Papers across Different Data Sources. Journal of Chinese Information Processing. 2018, 32(9): 84-92,131

参考文献

[1] C A,S A.Data manipulation in heterogeneous databases[J].ACM SIGMOD Record,1991,20(4):64-68,2.
[2] B R,M A. Disambiguating web appearances of people in a social network[C]//Proceedings of the 14th International Conference on World Wide Web,2005:463-470,2.
[3] T J,F A C,W B,et al. A unified prob-abilistic framework for name disambiguation in digital library[J].IEEE Transactions on Knowledge and Data Engineering,2012,24(6):975-987,2.
[4] W X,T J,C H,et al. Adana:Active name disambiguation[C]//Proceedings of the 11th IEEE International Conference on Data Mining,2011:794-803,2.
[5] E A K,I P G,V.V S. Duplicate record detection:A survey[J].IEEE Transactions on Knowledge and Data Engineering,2007,19(1):1-16,2.
[6] S W,W J,H J. Entity linking with a knowledge base:Issues,techniques,and solutions[J].IEEE Transactions on Knowledge and Data Engineering,2015,27(2):443-460,2.
[7] L L,L J,G H. Rule-Based method for entity res-olution[J].IEEE Transactions on Knowledge and Data Engineering,2015,27(1):250-263,2.
[8] T J,L J,L B,et al. Using Bayesian decision for ontology mapping[J].Web semantics:Science,services and agents on the world wide web,2006,4(4):243-262,3.
[9] R S,N X,X E,et al. A machine learning approach for instance matching based on similarity metrics[C]//Proceedings of the International Semantic Web Conference,2012:460-475,3.
[10] L J,Z F,S X,et al. What's in a name:an unsupervised approach to link users across communities[C]//Proceedings of the sixth ACM International Conference on Web Search and Data Mining,2013:495-504,3.
[11] W M,L Z,L H,et al. Syntax-based deep matching of short texts[C]//Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence,2015:1354-1361,3.
[12] P L,L Y,G J,et al. Text matching as image recognition[C]//Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence,2016:2793-2799. 3,6.
[13] SUN Y,LIN L,TANG D,et al. Modeling mention,context and entity with neural networks for entity dis-ambiguation[C]//Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence,2015:1333-1339,3.
[14] HU B,LU Z,LI H,et al. Convolutional neural network architectures for matching natural language sentences[C]//Proceedings of the Advances in Neural Information Processing Systems,2014:2042-2050,3.
[15] Le Q,Mikolov T. Distributed representations of sentences and documents[C]//Proceedings of the 31st International Conference on Machine Learning,2014:1188-1196,4.
[16] 李武军,周志华. 大数据哈希学习:现状与趋势[J].科学通报,2015(60):485-490,5.
[17] Datar M,Immorlica N,Indyk P,et al. Locality-sensitive hashing scheme based on p-stable distributions[C]//Proceedings of the Twentieth Annual Symposium on Computational Geometry,2004:253-262,5.
[18] Charikar M S. Similarity estimation techniques from rounding algorithms[C]//Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing,2002:380-388,5.
[19] Jiang Q Y,Li W J. Scalable graph Hashing with feature transformation[C]//Proceedings of the twenty-fourth international joint conference on artificial intelligence,2015:2248-2254,5.
[20] Mikolov T,Sutskever I,Chen K,et al. Dis-tributed representations of words and phrases and their compositionality[C]//Proceedings of Advances in neural information processing systems. 2013:3111-3119,6.
[21] Dahl G E,Sainath T N,Hinton G E. Improving deep neural networks for LVCSR using rectified linear units and dropout[C]//Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing,2013:8609-8613,7.
[22] Duchi J,Hazan E,Singer Y. Adaptive subgradient methods for online learning and stochastic optimization[J].Journal of Machine Learning Research,2011,12:2121-2159,7.
[23] Rehurek R,Sojka P. Software framework for topic modelling with large corpora[C]//Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks,2010:45-50,8.
PDF(2988 KB)

Accesses

Citation

Detail

段落导航
相关文章

/