中文信息学报 ›› 2018, Vol. 32 ›› Issue (9) : 84-92,131.


  • 张帆进,顾晓韬,姚沛然,唐杰
Conflating Papers across Different Data Sources

  • ZHANG Fanjin, GU Xiaotao, YAO Peiran, TANG Jie
该文研究跨数据源的论文集成问题,旨在将不同数据源中的同一论文匹配起来。该文提出了两个算法来解决论文匹配的问题,第一个算法(MHash)利用哈希算法来加速匹配,第二个算法(MCNN)利用卷积神经网络(CNN)来提高匹配的准确率。实验表明,结合论文的各种属性,MHash能够在快速得到匹配结果的同时,保持较高的准确率(93%+),而MCNN能够达到非常高的准确率(98%+)。同时,设计了一个针对大规模论文匹配的异步搜索框架,在15天内得到了64 639 608对AMiner和MAG论文的匹配结果。论文匹配结果和AMiner、MAG的全部论文数据已作为公开数据集发布。


This paper studies conflating papers across different data sources. We propose two algorithms for paper matching. The first algorithm (MHash) employs hashing technique to accelerate matching process. The second one (MCNN) leverages convolutional neural network (CNN) to improve matching accuracy. Experimental results show that,by combining different attributes of papers,MHash is able to execute matching process quickly,yielding a good accuracy (93%+) at the same time. Besides,MCNN can achieve more satisfactory accuracy (98%+). Meanwhile,we design an asynchronous search framework for largescale paper matching problem. Finally,we obtain 64,639,608 matching pairs of AMiner papers and MAG papers within 15 days. The matching results and all AMiner and MAG publication data have been published.


数据集成 / 卷积神经网络 / 哈希学习 / 网络爬虫

Key words

data integration / convolutional neural network / learning to hash / web crawler


