�ݺ�ߣ

Machine Learning
Chapter Two:
Data Preprocessing

1. Overview of data preprocessing
 Machine Learning requires collecting great amount of data to
achieve the intended objective.
 A real-world data generally contains an unusable format which
cannot be directly used for machine learning models.
 Before feeding data to ML, we have to make sure the quality of
data?
 Data preprocessing is a process of preparing the raw data and
making it suitable for a machine learning model.
 It is the crucial step while creating a machine learning model.
 It increases the accuracy and efficiency of a machine learning
model.

Data Quality
 A well-accepted multidimensional data quality
measures are the following:
 Accuracy (free from errors and outliers)
 Completeness (no missing attributes and values)
 Consistency (no inconsistent values and attributes)
 Timeliness (appropriateness of the data for the purpose it is
required)
 Believability (acceptability)
 Interpretability (easy to understand)
3

Why Data Preprocessing?
 Most of the data in the real world are poor quality
(Incomplete, Inconsistent, Noisy, Invalid, Redundant, …)
 incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
 e.g., occupation=“ ”
 noisy: containing errors or outliers
 e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 Redundant: including everything, some of which are
irrelevant to our task.
No quality data, no quality results!
4

Data is often of low quality
 Collecting the required data is challenging
 Why?
 You didn’t collect it yourself
 It probably was created for some other use, and then you came
along wanting to integrate it.
 People make mistakes (typos)
 Data collection instruments used may be faulty.
 Everyone had their own way of structuring and formatting data,
based on what was convenient for them.
 Users may purposely submit incorrect data values for
mandatory ﬁelds when they do not wish to submit personal
information .
5

6
2. Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
 Data integration
 Integration of data from multiple data sources
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Data discretization (for numerical data) and Concept hierarchy generation

8
2.1. Data Cleaning
 Data cleaning tasks – attempts to:
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration

9
Incomplete (Missing) Data:
 Data is not always available
 many tuples have no recorded value for several attributes,
such as customer income in sales data.
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 not register history or changes of the data.

10
How to Handle Missing Value?
 Ignore the tuple:
 usually done when class label is missing (when doing
classification).
 Not effective method unless several attributes missing values
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with:
 a global constant : e.g., “unknown”, a new class?!
 Use a measure of central tendency for the attribute (e.g., the
mean or median) to ﬁll in the missing value
 Average income of customer $28,000 (use this value to
replace).
 Use the most probable value :
 determined with regression, inference-based such as Bayesian
formula, or decision tree. (most popular)

How to Handle Missing Data?
Age Income Religion Gender
23 24,200 Muslim M
39 ? Christian F
45 45,390 ? F
Fill missing values using aggregate functions (e.g., average) or probabilistic
estimates on global value distribution
E.g., put the average income here, or put the most probable income based
on the fact that the person is 39 years old
E.g., put the most frequent religion here
11

12
Noisy Data
 Noise-is a random error or variance in a measured
variable
 Incorrect attribute values may be due to
 faulty data collection instruments(e.g.: OCR)
 data entry problems-Let say ‘green’ is written as ‘rgeen’
 data transmission problems
 technology limitation
 inconsistency in naming convention
12

13
How to Handle Noisy Data?
Manually check all data : tedious + infeasible?
Sort data by frequency
‘green’ is more frequent than ‘rgeen’
Works well for categorical data
 Use, say Numerical constraints to Catch Corrupt Data
 Weight can’t be negative
 People can’t have more than 2 parents
 Salary can’t be less than Birr 300
Check for outliers (the case of the 8 meters man)
check for correlated outliers using n-gram (“pregnant
male”)
People can be male
People can be pregnant
People can’t be male AND pregnant
13

2.2. Data Integration
 Data integration combines data from multiple sources
into a coherent store
 Because of the use of different sources, data that that is
fine on its own may become problematic when we want
to integrate it.
 Some of the issues are:
Different formats and structures
Conflicting and redundant data
Data at different levels
14

Data Integration: Formats
 Not everyone uses the same format. Do you agree?
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Dates are especially problematic:
 12/19/97
 19/12/97
 19/12/1997
 19-12-97
 Dec 19, 1997
 19 December 1997
 19th Dec. 1997
 Are you frequently writing money as:
 Birr 200, Br. 200, 200 Birr, …
15

16
Data Integration: Inconsistent
Inconsistent data: containing discrepancies in codes or
names, which is also the problem of lack of
standardization / naming conventions. e.g.,
Age=“26” vs. Birthday=“03/07/1986”
Some use “1,2,3” for rating; others “A, B, C”
Data Integration: Conflicting Data
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different
Possible reasons: different representations, different scales, e.g.,
American vs. British units
 weight measurement: KG or pound
 Height measurement: meter or inch

17
2.3.Data Reduction Strategies
Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
Data reduction strategies
Dimensionality reduction,
 Select best attributes or remove unimportant attributes
Numerosity reduction
 Reduce data volume by choosing alternative, smaller forms of
data representation
Data compression

18
Data Reduction: Dimensionality Reduction
 Dimensionality reduction
Helps to eliminate Irrelevant attributes and reduce noise: that
contain no information useful for model development.
E.g. is students' ID relevant to predict students' GPA?
Helps to avoid redundant attributes : that contain duplicate
information in one or more other attributes
E.g. purchase price of a product & the amount of sales tax paid
Reduce time and space required in model development
Allow easier visualization
 Method: attribute subset selection
One of the method to reduce dimensionality of data is by selecting
best attributes

19
Data Reduction: Numerosity Reduction
 Different methods can be used, including Clustering and
sampling
 Clustering
 Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
 There are many choices of clustering definitions and clustering
algorithms
 Sampling
 obtaining a small sample s to represent the whole data set N
 Key principle: Choose a representative subset of the data using
suitable sampling technique

20
2.4. Data Transformation
 A function that maps the entire set of values of a given
attribute to a new set of replacement values. such that
each old value can be identified with one of the new
values.
 Methods for data transformation
 Normalization: Scaled to fall within a smaller, specified range of
values
• min-max normalization
• z-score normalization
• decimal scaling
 Discretization: Reduce data size by dividing the range of a
continuous attribute into intervals.
– Discretization can be performed recursively on an attribute
using method such as
 Binning: divide values into intervals
 Concept hierarchy climbing: organizes concepts (i.e., attribute
values) hierarchically

Data Transformation: Normalization
 min-max normalization
 z-score normalization
 normalization by decimal scaling
A
A
A
A
A
A
min
new
min
new
max
new
min
max
min
v
v _
)
_
_
(
' 




A
A
dev
stand
mean
v
v
_
'


j
v
v
10
' Where j is the smallest integer such that
Max(| |)<1
'
v
21

Example:
 Suppose that the minimum and maximum values for the
attribute income are $12,000 and $98,000, respectively.
We would like to map income to the range [0.0, 1.0].
 Suppose that the mean and standard deviation of the
values for the attribute income are $54,000 and $16,000,
respectively.
 Suppose that the recorded values of A range from –986 to
917.
22

Normalization
 Min-max normalization:
– Ex. Let income range $12,000 to $98,000 is normalized to
[0.0, 1.0]. Then $73,600 is mapped to
 Z-score normalization (μ: mean, σ: standard deviation):
– Ex. Let μ = 54,000, σ = 16,000. Then,
 Decimal scaling: Suppose that the recorded values of A range from -986 to
917. To normalize by decimal scaling, we therefore divide each value by 1000
(i.e., j = 3) so that -986 normalizes to -0.986 and 917 normalizes to 0.917.
716
.
0
0
)
0
0
.
1
(
000
,
12
000
,
98
000
,
12
600
,
73





newMin
newMin
newMax
min
max
min
v
v
A
A
A




 )
(
'
A
A
v
v




'
225
.
1
000
,
16
000
,
54
600
,
73


23

Discretization and Concept Hierarchy
 Discretization
 reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals.
 Interval labels can then be used to replace actual data values
 Example:
 Binning methods – equal-width, equal-frequency
24

Binning
 Attribute values (for one attribute e.g., age):
 0, 4, 12, 16, 16, 18, 24, 26, 28
 Equi-width binning – for bin width of e.g., 10:
 Bin 1: 0, 4 [-,10) bin
 Bin 2: 12, 16, 16, 18 [10,20) bin
 Bin 3: 24, 26, 28 [20,+) bin
 – denote negative infinity, + positive infinity
 Equi-frequency binning – for bin density of e.g.,
3:
 Bin 1: 0, 4, 12 [-, 14) bin
 Bin 2: 16, 16, 18 [14, 21) bin
 Bin 3: 24, 26, 28 [21,+] bin
25

Concept Hierarchy Generation
Concept hierarchy:
organizes concepts (i.e., attribute values)
hierarchically.
Concept hierarchy formation:
 Recursively reduce the data by collecting
and replacing low level concepts (such as
numeric values for age) by higher level
concepts (such as child, youth, adult, or
senior)
Concept hierarchies can be explicitly
specified by domain experts.
country
Region or state
city
Sub city
Kebele
 It can be automatically formed by the
analysis of the number of distinct
values. E.g., for a set of attributes:
{Kebele, city, state, country}
 For numeric data, use discretization
methods.
26

3. Dataset
 Dataset is a collection of data
objects and their attributes
 An attribute is a property or
characteristic of an object
 Examples: eye color of a person,
temperature, etc.
 Attribute is also known as variable,
field, characteristic, dimension, or
feature
 A collection of attributes
describe an object
 Object is also known as record,
point, case, sample, entity, or
instance
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects

Types of Attributes
 The type of an attribute is determined by the set of possible values :
nominal, binary, ordinal, or numeric—the attribute can have.
 There are different types of attributes
Nominal- means “relating to names” .
 The values of a nominal attribute are symbols or names of
things.
 Nominal attributes are also referred to as categorical.
 Examples: hair-color( Black, Brown, Blond etc.) , Marital-
Status(Single, married, divorced and Widowed), Occupation
etc.
Ordinal:
 an attribute with possible values that have a meaningful order
or ranking among them
 Examples: rankings (e.g., grades, height {tall, medium, short}
28

Types of Attributes..
Binary :
 is a nominal attribute with only two categories or
states: 0-absent or 1-present , Boolean( true or false)
 Example: Smoker(0-not smoker or 1-smoker)
Interval-Scaled : Numeric Attributes
 are measured on a scale of equal-size units.
 allow us to compare and quantify the difference between values
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
Ratio-Scaled: Numeric Attributes
 a value as being a multiple (or ratio) of another value
 Examples: temperature in length, time, counts
29

Datasets preparation for learning
A standard machine learning technique is to divide the dataset into a
training set and a test set.
 Training dataset is used for model development.
 Test dataset, which is never seen during model development stage and used to
evaluates the accuracy of the model.
 There are various ways in which to separate the data into training
and test sets
 The holdout method
 Cross-validation
 The bootstrap
30

The holdout method
 In this methods, the given data are randomly partitioned
into two independent sets, a training set and a test set.
 Usually: one third for testing, the rest for training
 For small or “unbalanced” datasets, samples might not
be representative
 Few or none instances of some classes
 Stratified sample: advanced version of balancing the
data
 Make sure that each class is represented with approximately
equal proportions in both subsets.
 Random subsampling : a variation of the holdout method in
which the holdout method is repeated k times.
 The overall accuracy estimate is taken as the average of the
accuracies obtained from each iteration.

32
Cross-validation
 Cross-validation works as follows:
 First step: data is split into k subsets of equal-sized sets
randomly.
 A partition of a set is a collection of subsets for which the
intersection of any pair of sets is empty. That is, no element of
one subset is an element of another subset in a partition.
 Second step: each subset in turn is used for testing and the
remainder for training
This is called k-fold cross-validation
 Often the subsets are stratified before the cross-validation is
performed
 The error estimates are averaged to yield an overall error
estimate.

33
Cross-validation example:
— Break up data into groups of the same size
— Hold aside one group for testing and use the rest to build model
— Repeat
Test
33

Bootstrap
 the bootstrap method samples the given training tuples uniformly
with replacement
 the machine is allowed to select the same tuple more than once.
 A commonly used one is the .632 bootstrap
 Suppose we are given a data set of d tuples. The data set is
sampled d times, with replacement, resulting in a bootstrap sample
or training set of d samples.
 The data tuples that did not make it into the training set end up
forming the test set.
 on average, 63.2% of the original data tuples will end up in the
bootstrap sample, and the remaining 36.8% will form the test set
(hence, the name, .632 bootstrap)
34

Assignment
 Explain PCA(Principal Component Analysis)how
 How it works
 Advantage and disadvantage
35

�ݺ�ߣ

ML-ChapterTwo-Data Preprocessing.ppt

Recommended

More Related Content

Similar to ML-ChapterTwo-Data Preprocessing.ppt (20)

Recently uploaded (20)

ML-ChapterTwo-Data Preprocessing.ppt