A Human-and-Machine Cooperative Framework for Entity Resolution with Quality Guarantees

Introduction

Entity resolution (ER) usually refers to identifying the relational records that correspond to the same real-world entity. It has been extensively studied in literature[1]. However, most of the existing approaches do not have the mechanism for quality control. Even though there exists some work[2] (based on active learning) that can optimize recall while guaranteeing a pre-specified precision level, it is usually desirable in practice that the results have quality guarantees on both precision and recall fronts.

To this end, we propose a HUman-and-Machine cOoperative framework (HUMO) with a flexible mechanism for quality control. Its primary idea is to divide the pair instances in an ER task into easy ones, which can be labeled by machine with high accuracy, and more challenging ones, which require human intervention. HUMO is, to some extent, motivated by the success of human and machine cooperation in problem solving as demonstrated by crowdsourcing applications. We note that crowdsourcing for ER[3] mainly focused on how to make human work effectively and efficiently given a task. HUMO instead investigates the problem of how to assign the workload in a task between human and machine such that a quality requirement can be met. Since the workload assigned to human can usually be performed by crowdsourcing, HUMO can be considered to be a preprocessor before a crowdsourcing task can be invoked. In this demo, we make the following contributions:

Framework

Based on the monotonicity assumption that the more similar two records are, the more likely it is that they refer to the same real-world entity, HUMO divides a dataset into three disjoint subsets, , and , as shown in Figure 1. HUMO automatically labels the instances in as unmatched, the instances in as matched, and assigns the instances in for human verification.

​ Figure 1: The HUMO Framework.

Document

Video link

References

[1] P. Christen, Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer Science & Business Media, 2012.

[2] A. Arasu, M. G¨otz, and R. Kaushik, “On active learning of record matching packages,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010, pp. 783–794.

[3] J. Wang, T. Kraska, M. J. Franklin, and J. Feng, “Crowder: Crowdsourcing entity resolution,” Proceedings of the VLDB Endowment, vol. 5, no. 11, pp. 1483–1494, 2012.