Robust and Distributed Web-Scale Near-Dup Document Conflation in Microsoft Academic Service

  • Chieh-Han Wu ,
  • Yang Song

IEEE International Conference on Big Data - Workshop on Data Quality Issues |

Published by IEEE - Institute of Electrical and Electronics Engineers

In modern web-scale applications that collect data from different sources, entity conflation is a challenging task due to various data quality issues. In this paper, we propose a robust and distributed framework to perform conflation on noisy data in the Microsoft Academic Service dataset. Our framework contains two major components. In the offline component, we train a GBDT model to determine whether two papers from different sources should be conflated to the same paper entity. In the online component, we propose a scalable shingling algorithm that can apply our offline model to over 100 million instances. The result shows that our algorithm can conflate noisy data robustly and efficiently.