# A Human-and-Machine Cooperative Framework for Entity Resolution with Quality Guarantees

## Introduction

Entity resolution (ER) usually refers to identifying the relational records that correspond to the same real-world entity. It has been extensively studied in literature[1]. However, most of the existing approaches do not have the mechanism for quality control. Even though there exists some work[2] (based on active learning) that can optimize recall while guaranteeing a pre-specified precision level, it is usually desirable in practice that the results have quality guarantees on both precision and recall fronts.

To this end, we propose a HUman-and-Machine cOoperative framework (HUMO) with a flexible mechanism for quality control. Its primary idea is to divide the pair instances in an ER task into easy ones, which can be labeled by machine with high accuracy, and more challenging ones, which require human intervention. HUMO is, to some extent, motivated by the success of human and machine cooperation in problem solving as demonstrated by crowdsourcing applications. We note that crowdsourcing for ER[3] mainly focused on how to make human work effectively and efficiently given a task. HUMO instead investigates the problem of how to assign the workload in a task between human and machine such that a quality requirement can be met. Since the workload assigned to human can usually be performed by crowdsourcing, HUMO can be considered to be a preprocessor before a crowdsourcing task can be invoked. In this demo, we make the following contributions:

• We propose a human-and-machine cooperative framework (HUMO) for entity resolution that can enforce quality control on both precision and recall fronts;
• We introduce the problem of minimizing human cost given a quality requirement in HUMO and propose corresponding optimization techniques;
• We demo that HUMO achieves high-quality results with reasonable ROI in terms of human cost on real datasets.

## Framework

Based on the monotonicity assumption that the more similar two records are, the more likely it is that they refer to the same real-world entity, HUMO divides a dataset $D$ into three disjoint subsets, $D_1$, $D_2$ and $D_3$, as shown in Figure 1. HUMO automatically labels the instances in $D_1$ as unmatched, the instances in $D_3$ as matched, and assigns the instances in $D_2$ for human verification.

​ Figure 1: The HUMO Framework.