This document describes a method for enriching data imputation under similarity rule constraints. It proposes utilizing similarity rules with tolerance for small variations, instead of strict equality constraints, to rule out invalid candidates provided by similar neighbors. The method aims to maximize data imputation by filling missing values. It analyzes the NP-hardness of solving and approximating the problem. Experiments on real and synthetic datasets demonstrate improved filling accuracy and record matching performance compared to other methods.
1 of 1
Download to read offline
More Related Content
Enriching data imputation under similarity rule constraints
1. 2020 2021
#13/ 19, 1st Floor, Municipal Colony, Kangayanellore Road, Gandhi Nagar, Vellore 6.
Off: 0416-2247353 Mo: +91 9500218218 / +91 8220150373
Website: www.shakastech.com, Email - id: shakastech@gmail.com, info@shakastech.com
Enriching Data Imputation under Similarity Rule Constraints
Abstract
Incomplete information often occurs along with many database applications, e.g., in
data integration, data cleaning, or data exchange. The idea of data imputation is often to
fill the missing data with the values of its neighbors who share the same/similar
information. Such neighbors could either be identified certainly by editing rules or
extensively by similarity relationships. Owing to data sparsity, the number of neighbors
identified by editing rules w.r.t. value equality is rather limited, especially in the presence
of data values with variances. To enrich the imputation candidates, a natural idea is to
extensively consider the neighbors with similarity relationship. However, the candidates
suggested by these (heterogenous) similarity neighbors may conflict with each other. In
this paper, we propose to utilize the similarity rules with tolerance to small variations
(instead of the aforesaid editing rules with strict equality constraints) to rule out the
invalid candidates provided by similarity neighbors. To enrich the data imputation, i.e.,
imputing the missing values more, we study the problem of maximizing the missing data
imputation. Our major contributions include (1) the NP-hardness analysis on solving as
well as approximating the problem, (2) exact algorithms for tackling the problem, and (3)
efficient approximation with performance guarantees. Experiments on real and synthetic
data sets demonstrate the superiority of our proposal in filling accuracy. We also
demonstrate that the record matching application is indeed improved, after applying the
proposed imputation.