3. Data Preprocessing
Data Preprocessing: An Overview
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
Data Transformation
3
3
4. Data Quality: Why Preprocess the Data?
Measures for data quality: A multidimensional view
Accuracy: correct or wrong, accurate or not
Completeness: not recorded, unavailable,
Consistency: some modified but some not, dangling,
Timeliness: timely update?
Believability: how trustable the data are correct?
Interpretability: how easily the data can be
understood?
4
5. Major Tasks in Data Preprocessing
Data cleaning
Data integration
Data reduction
Data transformation
5
6. Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
e.g., Occupation= (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary=10 (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
Age=42, Birthday=03/07/2010
Was rating 1, 2, 3, now rating A, B, C
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Jan. 1 as everyones birthday?
6
7. How to Handle Missing Data?
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., unknown, a new class?!
the attribute mean
the attribute mean for all samples belonging to the
same class: smarter
the most probable value: inference-based such as
Bayesian formula or decision tree
Manual- small data set
Automatic larger data set more efficient
7
8. 8
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may be due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which require data cleaning
duplicate records
incomplete data
inconsistent data
9. How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
Similar item are grouped and detect and remove
outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g., deal
with possible outliers)
9
10. Data Integration
Data integration:
Combines data from multiple heterogeneous sources into a
coherent store
2 types
Tight Coupling
Data is combined together into a physical location
Loose coupling
only an interface is created, and the data is combined through
the interface and accessed through the interface
data reminds in actual database only
10
10
12. Data Reduction
Dimensionality reduction
Reduces the number of input variables in the data set because
the large input variables will result in poor performance
Data cube aggregation
Data is combined to construct a data cube
Attribute subset selection
highly relevant attributes should be used, and other attributes
should be discarded or removed, so in this way the data can be
reduced
Numerosity Reduction:
Here, we store only model of data instead of entire data
Parametric
Non-parametric: Histogram, Cluster, Sampling
12
12
16. Data Transformation
Data is transformed into appropriate form suitable for mining
process
There are 4 methods in data transformation
1. Normalization
2. Attribute selection
3. Discretization
4. Concept hierarchy generation
1. Normalization
Normalization is done in order to scale the data values in a
specified range
For example, -1.0 to + 1.0 or 0 to 1
16
16
17. Data Transformation
2. Attribute selection
New attributes are created using older ones
3. Discretization
Raw values are replaced by interval values
4. Concept hierarchy generation
Attributes are converted from low level to the higher level
Example: city to country
17
17
18. Automatic Concept Hierarchy Generation
Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
The attribute with the most distinct values is placed at
the lowest level of the hierarchy
Exceptions, e.g., weekday, month, quarter, year
18
country
province_or_ state
city
street
15 distinct values
365 distinct values
3567 distinct values
674,339 distinct values
20. Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
20
21. 21
Discretization Without Using Class Labels
(Binning vs. Clustering)
Data Equal interval width
(binning)
Equal frequency (binning) K-means clustering leads to better
results
22. Summary
Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
Data cleaning: e.g. missing/noisy values, outliers
Data integration from multiple sources:
Entity identification problem
Remove redundancies
Detect inconsistencies
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation
Normalization
Concept hierarchy generation
22