Database Research Group PhD Thesis Defence

2012 Mar 07 at 13:00

DC 2310

Modeling and Querying Uncertainty in Data Cleaning

George Beskales, PhD candidate, David R. Cheriton School of Comp. Sci., Univ. Waterloo

Data quality problems such as duplicate records, missing values, and violation of integrity constrains are frequently encountered in real world applications. Such problems cost enterprises billions of dollars annually, and might have unpredictable consequences in mission critical tasks. The process of data cleaning refers to detecting and correcting errors in data in order to improve data quality. Numerous efforts have been done towards improving the effectiveness and the efficiency of data cleaning.

A major challenge in the data cleaning process is the presence of conflicting soft evidences that suggest different actions to clean the database. For example, in the context of duplicate elimination, a moderate similarity between two tuples bears two contradicting evidences suggesting that the tuples could be either duplicates or non-duplicates. Existing data cleaning systems resolve conflicting evidences by eliminating (i.e., \emph{destroying}) a subset of evidences based on several heuristics such that the remaining evidences are consistent. Furthermore, because of the complex dependencies among evidences, it is difficult to reverse the process of eliminating some evidences (e.g., when new external information becomes available).

In this dissertation, we propose a non-destructive data cleaning approach that aims at preserving the existing evidences. The main insight is to view conflicts in evidences as uncertainty in deciding which actions should be taken to clean data errors. Thus, the data cleaning process becomes a random process whose possible outcomes are possible clean instances (i.e., repairs). Our approach generates multiple possible clean instances to avoid the destructive aspect of current cleaning systems. In this dissertation, we apply this approach in the context of two prominent data cleaning problems: duplicate elimination, and repairing violations of functional dependencies (FDs). Furthermore, we show how to enable non-destructive data cleaning in situations where both the data and the FDs are untrusted.