際際滷

際際滷Share a Scribd company logo
Machine Learning
Chapter Two:
Data Preprocessing
1. Overview of data preprocessing
 Machine Learning requires collecting great amount of data to
achieve the intended objective.
 A real-world data generally contains an unusable format which
cannot be directly used for machine learning models.
 Before feeding data to ML, we have to make sure the quality of
data?
 Data preprocessing is a process of preparing the raw data and
making it suitable for a machine learning model.
 It is the crucial step while creating a machine learning model.
 It increases the accuracy and efficiency of a machine learning
model.
Data Quality
 A well-accepted multidimensional data quality
measures are the following:
 Accuracy (free from errors and outliers)
 Completeness (no missing attributes and values)
 Consistency (no inconsistent values and attributes)
 Timeliness (appropriateness of the data for the purpose it is
required)
 Believability (acceptability)
 Interpretability (easy to understand)
3
Why Data Preprocessing?
 Most of the data in the real world are poor quality
(Incomplete, Inconsistent, Noisy, Invalid, Redundant, )
 incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
 e.g., occupation= 
 noisy: containing errors or outliers
 e.g., Salary=-10
 inconsistent: containing discrepancies in codes or names
 e.g., Age=42 Birthday=03/07/1997
 e.g., Was rating 1,2,3, now rating A, B, C
 Redundant: including everything, some of which are
irrelevant to our task.
No quality data, no quality results!
4
Data is often of low quality
 Collecting the required data is challenging
 Why?
 You didnt collect it yourself
 It probably was created for some other use, and then you came
along wanting to integrate it.
 People make mistakes (typos)
 Data collection instruments used may be faulty.
 Everyone had their own way of structuring and formatting data,
based on what was convenient for them.
 Users may purposely submit incorrect data values for
mandatory 鍖elds when they do not wish to submit personal
information .
5
6
2. Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
 Data integration
 Integration of data from multiple data sources
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Data discretization (for numerical data) and Concept hierarchy generation
Forms of data preprocessing
7
8
2.1. Data Cleaning
 Data cleaning tasks  attempts to:
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
9
Incomplete (Missing) Data:
 Data is not always available
 many tuples have no recorded value for several attributes,
such as customer income in sales data.
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time of
entry
 not register history or changes of the data.
10
How to Handle Missing Value?
 Ignore the tuple:
 usually done when class label is missing (when doing
classification).
 Not effective method unless several attributes missing values
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with:
 a global constant : e.g., unknown, a new class?!
 Use a measure of central tendency for the attribute (e.g., the
mean or median) to 鍖ll in the missing value
 Average income of customer $28,000 (use this value to
replace).
 Use the most probable value :
 determined with regression, inference-based such as Bayesian
formula, or decision tree. (most popular)
How to Handle Missing Data?
Age Income Religion Gender
23 24,200 Muslim M
39 ? Christian F
45 45,390 ? F
Fill missing values using aggregate functions (e.g., average) or probabilistic
estimates on global value distribution
E.g., put the average income here, or put the most probable income based
on the fact that the person is 39 years old
E.g., put the most frequent religion here
11
12
Noisy Data
 Noise-is a random error or variance in a measured
variable
 Incorrect attribute values may be due to
 faulty data collection instruments(e.g.: OCR)
 data entry problems-Let say green is written as rgeen
 data transmission problems
 technology limitation
 inconsistency in naming convention
12
13
How to Handle Noisy Data?
Manually check all data : tedious + infeasible?
Sort data by frequency
green is more frequent than rgeen
Works well for categorical data
 Use, say Numerical constraints to Catch Corrupt Data
 Weight cant be negative
 People cant have more than 2 parents
 Salary cant be less than Birr 300
Check for outliers (the case of the 8 meters man)
check for correlated outliers using n-gram (pregnant
male)
People can be male
People can be pregnant
People cant be male AND pregnant
13
2.2. Data Integration
 Data integration combines data from multiple sources
into a coherent store
 Because of the use of different sources, data that that is
fine on its own may become problematic when we want
to integrate it.
 Some of the issues are:
Different formats and structures
Conflicting and redundant data
Data at different levels
14
Data Integration: Formats
 Not everyone uses the same format. Do you agree?
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Dates are especially problematic:
 12/19/97
 19/12/97
 19/12/1997
 19-12-97
 Dec 19, 1997
 19 December 1997
 19th Dec. 1997
 Are you frequently writing money as:
 Birr 200, Br. 200, 200 Birr, 
15
16
Data Integration: Inconsistent
Inconsistent data: containing discrepancies in codes or
names, which is also the problem of lack of
standardization / naming conventions. e.g.,
Age=26 vs. Birthday=03/07/1986
Some use 1,2,3 for rating; others A, B, C
Data Integration: Conflicting Data
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different
Possible reasons: different representations, different scales, e.g.,
American vs. British units
 weight measurement: KG or pound
 Height measurement: meter or inch
17
2.3.Data Reduction Strategies
Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
Data reduction strategies
Dimensionality reduction,
 Select best attributes or remove unimportant attributes
Numerosity reduction
 Reduce data volume by choosing alternative, smaller forms of
data representation
Data compression
18
Data Reduction: Dimensionality Reduction
 Dimensionality reduction
Helps to eliminate Irrelevant attributes and reduce noise: that
contain no information useful for model development.
E.g. is students' ID relevant to predict students' GPA?
Helps to avoid redundant attributes : that contain duplicate
information in one or more other attributes
E.g. purchase price of a product & the amount of sales tax paid
Reduce time and space required in model development
Allow easier visualization
 Method: attribute subset selection
One of the method to reduce dimensionality of data is by selecting
best attributes
19
Data Reduction: Numerosity Reduction
 Different methods can be used, including Clustering and
sampling
 Clustering
 Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
 There are many choices of clustering definitions and clustering
algorithms
 Sampling
 obtaining a small sample s to represent the whole data set N
 Key principle: Choose a representative subset of the data using
suitable sampling technique
20
2.4. Data Transformation
 A function that maps the entire set of values of a given
attribute to a new set of replacement values. such that
each old value can be identified with one of the new
values.
 Methods for data transformation
 Normalization: Scaled to fall within a smaller, specified range of
values
 min-max normalization
 z-score normalization
 decimal scaling
 Discretization: Reduce data size by dividing the range of a
continuous attribute into intervals.
 Discretization can be performed recursively on an attribute
using method such as
 Binning: divide values into intervals
 Concept hierarchy climbing: organizes concepts (i.e., attribute
values) hierarchically
Data Transformation: Normalization
 min-max normalization
 z-score normalization
 normalization by decimal scaling
A
A
A
A
A
A
min
new
min
new
max
new
min
max
min
v
v _
)
_
_
(
' 




A
A
dev
stand
mean
v
v
_
'


j
v
v
10
' Where j is the smallest integer such that
Max(| |)<1
'
v
21
Example:
 Suppose that the minimum and maximum values for the
attribute income are $12,000 and $98,000, respectively.
We would like to map income to the range [0.0, 1.0].
 Suppose that the mean and standard deviation of the
values for the attribute income are $54,000 and $16,000,
respectively.
 Suppose that the recorded values of A range from 986 to
917.
22
Normalization
 Min-max normalization:
 Ex. Let income range $12,000 to $98,000 is normalized to
[0.0, 1.0]. Then $73,600 is mapped to
 Z-score normalization (亮: mean, : standard deviation):
 Ex. Let 亮 = 54,000,  = 16,000. Then,
 Decimal scaling: Suppose that the recorded values of A range from -986 to
917. To normalize by decimal scaling, we therefore divide each value by 1000
(i.e., j = 3) so that -986 normalizes to -0.986 and 917 normalizes to 0.917.
716
.
0
0
)
0
0
.
1
(
000
,
12
000
,
98
000
,
12
600
,
73





newMin
newMin
newMax
min
max
min
v
v
A
A
A




 )
(
'
A
A
v
v




'
225
.
1
000
,
16
000
,
54
600
,
73


23
Discretization and Concept Hierarchy
 Discretization
 reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals.
 Interval labels can then be used to replace actual data values
 Example:
 Binning methods  equal-width, equal-frequency
24
Binning
 Attribute values (for one attribute e.g., age):
 0, 4, 12, 16, 16, 18, 24, 26, 28
 Equi-width binning  for bin width of e.g., 10:
 Bin 1: 0, 4 [-,10) bin
 Bin 2: 12, 16, 16, 18 [10,20) bin
 Bin 3: 24, 26, 28 [20,+) bin
  denote negative infinity, + positive infinity
 Equi-frequency binning  for bin density of e.g.,
3:
 Bin 1: 0, 4, 12 [-, 14) bin
 Bin 2: 16, 16, 18 [14, 21) bin
 Bin 3: 24, 26, 28 [21,+] bin
25
Concept Hierarchy Generation
Concept hierarchy:
organizes concepts (i.e., attribute values)
hierarchically.
Concept hierarchy formation:
 Recursively reduce the data by collecting
and replacing low level concepts (such as
numeric values for age) by higher level
concepts (such as child, youth, adult, or
senior)
Concept hierarchies can be explicitly
specified by domain experts.
country
Region or state
city
Sub city
Kebele
 It can be automatically formed by the
analysis of the number of distinct
values. E.g., for a set of attributes:
{Kebele, city, state, country}
 For numeric data, use discretization
methods.
26
3. Dataset
 Dataset is a collection of data
objects and their attributes
 An attribute is a property or
characteristic of an object
 Examples: eye color of a person,
temperature, etc.
 Attribute is also known as variable,
field, characteristic, dimension, or
feature
 A collection of attributes
describe an object
 Object is also known as record,
point, case, sample, entity, or
instance
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects
Types of Attributes
 The type of an attribute is determined by the set of possible values :
nominal, binary, ordinal, or numericthe attribute can have.
 There are different types of attributes
Nominal- means relating to names .
 The values of a nominal attribute are symbols or names of
things.
 Nominal attributes are also referred to as categorical.
 Examples: hair-color( Black, Brown, Blond etc.) , Marital-
Status(Single, married, divorced and Widowed), Occupation
etc.
Ordinal:
 an attribute with possible values that have a meaningful order
or ranking among them
 Examples: rankings (e.g., grades, height {tall, medium, short}
28
Types of Attributes..
Binary :
 is a nominal attribute with only two categories or
states: 0-absent or 1-present , Boolean( true or false)
 Example: Smoker(0-not smoker or 1-smoker)
Interval-Scaled : Numeric Attributes
 are measured on a scale of equal-size units.
 allow us to compare and quantify the difference between values
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
Ratio-Scaled: Numeric Attributes
 a value as being a multiple (or ratio) of another value
 Examples: temperature in length, time, counts
29
Datasets preparation for learning
A standard machine learning technique is to divide the dataset into a
training set and a test set.
 Training dataset is used for model development.
 Test dataset, which is never seen during model development stage and used to
evaluates the accuracy of the model.
 There are various ways in which to separate the data into training
and test sets
 The holdout method
 Cross-validation
 The bootstrap
30
The holdout method
 In this methods, the given data are randomly partitioned
into two independent sets, a training set and a test set.
 Usually: one third for testing, the rest for training
 For small or unbalanced datasets, samples might not
be representative
 Few or none instances of some classes
 Stratified sample: advanced version of balancing the
data
 Make sure that each class is represented with approximately
equal proportions in both subsets.
 Random subsampling : a variation of the holdout method in
which the holdout method is repeated k times.
 The overall accuracy estimate is taken as the average of the
accuracies obtained from each iteration.
32
Cross-validation
 Cross-validation works as follows:
 First step: data is split into k subsets of equal-sized sets
randomly.
 A partition of a set is a collection of subsets for which the
intersection of any pair of sets is empty. That is, no element of
one subset is an element of another subset in a partition.
 Second step: each subset in turn is used for testing and the
remainder for training
This is called k-fold cross-validation
 Often the subsets are stratified before the cross-validation is
performed
 The error estimates are averaged to yield an overall error
estimate.
33
Cross-validation example:
 Break up data into groups of the same size
 Hold aside one group for testing and use the rest to build model
 Repeat
Test
33
Bootstrap
 the bootstrap method samples the given training tuples uniformly
with replacement
 the machine is allowed to select the same tuple more than once.
 A commonly used one is the .632 bootstrap
 Suppose we are given a data set of d tuples. The data set is
sampled d times, with replacement, resulting in a bootstrap sample
or training set of d samples.
 The data tuples that did not make it into the training set end up
forming the test set.
 on average, 63.2% of the original data tuples will end up in the
bootstrap sample, and the remaining 36.8% will form the test set
(hence, the name, .632 bootstrap)
34
Assignment
 Explain PCA(Principal Component Analysis)how
 How it works
 Advantage and disadvantage
35

More Related Content

Similar to ML-ChapterTwo-Data Preprocessing.ppt (20)

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Hoang Nguyen
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
saranya12345
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
datapreprocessing
Assignmentdatamining
AssignmentdataminingAssignmentdatamining
Assignmentdatamining
Chandrika Sweety
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
Revathy V R
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
DrGnaneswariG
Preprocess
PreprocessPreprocess
Preprocess
sharmilajohn
Data processing
Data processingData processing
Data processing
Sania Shoaib
data processing.pdf
data processing.pdfdata processing.pdf
data processing.pdf
DimpyJindal4
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
Mahmoud Alfarra
preproccessing level 3 for students.ppt
preproccessing level 3 for  students.pptpreproccessing level 3 for  students.ppt
preproccessing level 3 for students.ppt
AhmedAlrashdy
Datapreprocess
DatapreprocessDatapreprocess
Datapreprocess
sharmila parveen
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
ImXaib
Data preperation
Data preperationData preperation
Data preperation
Hoang Nguyen
Data preperation
Data preperationData preperation
Data preperation
Fraboni Ec
Data preperation
Data preperationData preperation
Data preperation
Luis Goldster
Data preparation
Data preparationData preparation
Data preparation
Tony Nguyen
Data preparation
Data preparationData preparation
Data preparation
Young Alista
Data preparation
Data preparationData preparation
Data preparation
James Wong
Data preparation
Data preparationData preparation
Data preparation
Harry Potter
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Hoang Nguyen
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
saranya12345
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
Revathy V R
Data processing
Data processingData processing
Data processing
Sania Shoaib
data processing.pdf
data processing.pdfdata processing.pdf
data processing.pdf
DimpyJindal4
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
Mahmoud Alfarra
preproccessing level 3 for students.ppt
preproccessing level 3 for  students.pptpreproccessing level 3 for  students.ppt
preproccessing level 3 for students.ppt
AhmedAlrashdy
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
ImXaib
Data preperation
Data preperationData preperation
Data preperation
Hoang Nguyen
Data preperation
Data preperationData preperation
Data preperation
Fraboni Ec
Data preparation
Data preparationData preparation
Data preparation
Tony Nguyen
Data preparation
Data preparationData preparation
Data preparation
Young Alista
Data preparation
Data preparationData preparation
Data preparation
James Wong
Data preparation
Data preparationData preparation
Data preparation
Harry Potter

Recently uploaded (20)

Lecture 2-DATABASE MODELS lecture 2.pptx
Lecture 2-DATABASE MODELS lecture 2.pptxLecture 2-DATABASE MODELS lecture 2.pptx
Lecture 2-DATABASE MODELS lecture 2.pptx
elvis24mutura
Big-O notations, Algorithm and complexity analaysis
Big-O notations, Algorithm and complexity analaysisBig-O notations, Algorithm and complexity analaysis
Big-O notations, Algorithm and complexity analaysis
drsomya2019
Selzy: Simplifying Email Marketing for Maximum Growth
Selzy: Simplifying Email Marketing for Maximum GrowthSelzy: Simplifying Email Marketing for Maximum Growth
Selzy: Simplifying Email Marketing for Maximum Growth
Selzy
Pr辿sentation did辿e id辿e pour faire un projet
Pr辿sentation did辿e id辿e pour faire un projetPr辿sentation did辿e id辿e pour faire un projet
Pr辿sentation did辿e id辿e pour faire un projet
tahatraval88
PostGIS Workshop: a comprehensive tutorial.ppt
PostGIS Workshop: a comprehensive tutorial.pptPostGIS Workshop: a comprehensive tutorial.ppt
PostGIS Workshop: a comprehensive tutorial.ppt
LonJames2
Sources of Data and Data collection methods.pptx
Sources of Data and Data collection methods.pptxSources of Data and Data collection methods.pptx
Sources of Data and Data collection methods.pptx
denniskhisa
AI system mimicking human expert decision-making..pptx
AI system mimicking human expert decision-making..pptxAI system mimicking human expert decision-making..pptx
AI system mimicking human expert decision-making..pptx
ritikacompscience
diagram ANN of factor and responses.pptx
diagram ANN of factor and responses.pptxdiagram ANN of factor and responses.pptx
diagram ANN of factor and responses.pptx
EdunjobiTunde1
ARCH 2025: New Mexico Respite Provider Registry
ARCH 2025: New Mexico Respite Provider RegistryARCH 2025: New Mexico Respite Provider Registry
ARCH 2025: New Mexico Respite Provider Registry
Allen Shaw
Orange County Tableau User Group 2025 Late Q1 2025-03-23.pdf
Orange County Tableau User Group 2025 Late Q1 2025-03-23.pdfOrange County Tableau User Group 2025 Late Q1 2025-03-23.pdf
Orange County Tableau User Group 2025 Late Q1 2025-03-23.pdf
gemmajfrancisco
inSis AI - An Industrial AI Platform for Process Plants
inSis AI - An Industrial AI Platform for Process PlantsinSis AI - An Industrial AI Platform for Process Plants
inSis AI - An Industrial AI Platform for Process Plants
Kondapi V Siva Rama Brahmam
chap2_nnejjejehhehehhhhhhhhhehslides.ppt
chap2_nnejjejehhehehhhhhhhhhehslides.pptchap2_nnejjejehhehehhhhhhhhhehslides.ppt
chap2_nnejjejehhehehhhhhhhhhehslides.ppt
Nikhil620181
networkmonitoringtools-200615094423.pptx
networkmonitoringtools-200615094423.pptxnetworkmonitoringtools-200615094423.pptx
networkmonitoringtools-200615094423.pptx
kelvinzallan5
Dynamic-Data-Visualization-Dashboard.pptx
Dynamic-Data-Visualization-Dashboard.pptxDynamic-Data-Visualization-Dashboard.pptx
Dynamic-Data-Visualization-Dashboard.pptx
bammidigovinda108
How can Competitive Intelligence Platforms benefit a Business?
How can Competitive Intelligence Platforms benefit a Business?How can Competitive Intelligence Platforms benefit a Business?
How can Competitive Intelligence Platforms benefit a Business?
Contify
Drillingis_optimizedusingartificialneural.pptx
Drillingis_optimizedusingartificialneural.pptxDrillingis_optimizedusingartificialneural.pptx
Drillingis_optimizedusingartificialneural.pptx
singhsanjays2107
PLAN_OF_WORK_PPT_BY_ROHIT_BHAIRAM_--2212020201003[1] new.pptx
PLAN_OF_WORK_PPT_BY_ROHIT_BHAIRAM_--2212020201003[1] new.pptxPLAN_OF_WORK_PPT_BY_ROHIT_BHAIRAM_--2212020201003[1] new.pptx
PLAN_OF_WORK_PPT_BY_ROHIT_BHAIRAM_--2212020201003[1] new.pptx
bhairamrohit948
KISHAN GAMINjwjshjxjwjhskwkdjehjshds.pptx
KISHAN GAMINjwjshjxjwjhskwkdjehjshds.pptxKISHAN GAMINjwjshjxjwjhskwkdjehjshds.pptx
KISHAN GAMINjwjshjxjwjhskwkdjehjshds.pptx
maheshbochare
CHAP-0- Lecture Overview Administration--TCPS (SS-2023)-Rev (1)--final.pdf
CHAP-0- Lecture Overview  Administration--TCPS (SS-2023)-Rev (1)--final.pdfCHAP-0- Lecture Overview  Administration--TCPS (SS-2023)-Rev (1)--final.pdf
CHAP-0- Lecture Overview Administration--TCPS (SS-2023)-Rev (1)--final.pdf
yasinalistudy
Capital market of Nigeria and its economic values
Capital market of Nigeria and its economic valuesCapital market of Nigeria and its economic values
Capital market of Nigeria and its economic values
ezehnelson104
Lecture 2-DATABASE MODELS lecture 2.pptx
Lecture 2-DATABASE MODELS lecture 2.pptxLecture 2-DATABASE MODELS lecture 2.pptx
Lecture 2-DATABASE MODELS lecture 2.pptx
elvis24mutura
Big-O notations, Algorithm and complexity analaysis
Big-O notations, Algorithm and complexity analaysisBig-O notations, Algorithm and complexity analaysis
Big-O notations, Algorithm and complexity analaysis
drsomya2019
Selzy: Simplifying Email Marketing for Maximum Growth
Selzy: Simplifying Email Marketing for Maximum GrowthSelzy: Simplifying Email Marketing for Maximum Growth
Selzy: Simplifying Email Marketing for Maximum Growth
Selzy
Pr辿sentation did辿e id辿e pour faire un projet
Pr辿sentation did辿e id辿e pour faire un projetPr辿sentation did辿e id辿e pour faire un projet
Pr辿sentation did辿e id辿e pour faire un projet
tahatraval88
PostGIS Workshop: a comprehensive tutorial.ppt
PostGIS Workshop: a comprehensive tutorial.pptPostGIS Workshop: a comprehensive tutorial.ppt
PostGIS Workshop: a comprehensive tutorial.ppt
LonJames2
Sources of Data and Data collection methods.pptx
Sources of Data and Data collection methods.pptxSources of Data and Data collection methods.pptx
Sources of Data and Data collection methods.pptx
denniskhisa
AI system mimicking human expert decision-making..pptx
AI system mimicking human expert decision-making..pptxAI system mimicking human expert decision-making..pptx
AI system mimicking human expert decision-making..pptx
ritikacompscience
diagram ANN of factor and responses.pptx
diagram ANN of factor and responses.pptxdiagram ANN of factor and responses.pptx
diagram ANN of factor and responses.pptx
EdunjobiTunde1
ARCH 2025: New Mexico Respite Provider Registry
ARCH 2025: New Mexico Respite Provider RegistryARCH 2025: New Mexico Respite Provider Registry
ARCH 2025: New Mexico Respite Provider Registry
Allen Shaw
Orange County Tableau User Group 2025 Late Q1 2025-03-23.pdf
Orange County Tableau User Group 2025 Late Q1 2025-03-23.pdfOrange County Tableau User Group 2025 Late Q1 2025-03-23.pdf
Orange County Tableau User Group 2025 Late Q1 2025-03-23.pdf
gemmajfrancisco
inSis AI - An Industrial AI Platform for Process Plants
inSis AI - An Industrial AI Platform for Process PlantsinSis AI - An Industrial AI Platform for Process Plants
inSis AI - An Industrial AI Platform for Process Plants
Kondapi V Siva Rama Brahmam
chap2_nnejjejehhehehhhhhhhhhehslides.ppt
chap2_nnejjejehhehehhhhhhhhhehslides.pptchap2_nnejjejehhehehhhhhhhhhehslides.ppt
chap2_nnejjejehhehehhhhhhhhhehslides.ppt
Nikhil620181
networkmonitoringtools-200615094423.pptx
networkmonitoringtools-200615094423.pptxnetworkmonitoringtools-200615094423.pptx
networkmonitoringtools-200615094423.pptx
kelvinzallan5
Dynamic-Data-Visualization-Dashboard.pptx
Dynamic-Data-Visualization-Dashboard.pptxDynamic-Data-Visualization-Dashboard.pptx
Dynamic-Data-Visualization-Dashboard.pptx
bammidigovinda108
How can Competitive Intelligence Platforms benefit a Business?
How can Competitive Intelligence Platforms benefit a Business?How can Competitive Intelligence Platforms benefit a Business?
How can Competitive Intelligence Platforms benefit a Business?
Contify
Drillingis_optimizedusingartificialneural.pptx
Drillingis_optimizedusingartificialneural.pptxDrillingis_optimizedusingartificialneural.pptx
Drillingis_optimizedusingartificialneural.pptx
singhsanjays2107
PLAN_OF_WORK_PPT_BY_ROHIT_BHAIRAM_--2212020201003[1] new.pptx
PLAN_OF_WORK_PPT_BY_ROHIT_BHAIRAM_--2212020201003[1] new.pptxPLAN_OF_WORK_PPT_BY_ROHIT_BHAIRAM_--2212020201003[1] new.pptx
PLAN_OF_WORK_PPT_BY_ROHIT_BHAIRAM_--2212020201003[1] new.pptx
bhairamrohit948
KISHAN GAMINjwjshjxjwjhskwkdjehjshds.pptx
KISHAN GAMINjwjshjxjwjhskwkdjehjshds.pptxKISHAN GAMINjwjshjxjwjhskwkdjehjshds.pptx
KISHAN GAMINjwjshjxjwjhskwkdjehjshds.pptx
maheshbochare
CHAP-0- Lecture Overview Administration--TCPS (SS-2023)-Rev (1)--final.pdf
CHAP-0- Lecture Overview  Administration--TCPS (SS-2023)-Rev (1)--final.pdfCHAP-0- Lecture Overview  Administration--TCPS (SS-2023)-Rev (1)--final.pdf
CHAP-0- Lecture Overview Administration--TCPS (SS-2023)-Rev (1)--final.pdf
yasinalistudy
Capital market of Nigeria and its economic values
Capital market of Nigeria and its economic valuesCapital market of Nigeria and its economic values
Capital market of Nigeria and its economic values
ezehnelson104

ML-ChapterTwo-Data Preprocessing.ppt

  • 2. 1. Overview of data preprocessing Machine Learning requires collecting great amount of data to achieve the intended objective. A real-world data generally contains an unusable format which cannot be directly used for machine learning models. Before feeding data to ML, we have to make sure the quality of data? Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model. It is the crucial step while creating a machine learning model. It increases the accuracy and efficiency of a machine learning model.
  • 3. Data Quality A well-accepted multidimensional data quality measures are the following: Accuracy (free from errors and outliers) Completeness (no missing attributes and values) Consistency (no inconsistent values and attributes) Timeliness (appropriateness of the data for the purpose it is required) Believability (acceptability) Interpretability (easy to understand) 3
  • 4. Why Data Preprocessing? Most of the data in the real world are poor quality (Incomplete, Inconsistent, Noisy, Invalid, Redundant, ) incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation= noisy: containing errors or outliers e.g., Salary=-10 inconsistent: containing discrepancies in codes or names e.g., Age=42 Birthday=03/07/1997 e.g., Was rating 1,2,3, now rating A, B, C Redundant: including everything, some of which are irrelevant to our task. No quality data, no quality results! 4
  • 5. Data is often of low quality Collecting the required data is challenging Why? You didnt collect it yourself It probably was created for some other use, and then you came along wanting to integrate it. People make mistakes (typos) Data collection instruments used may be faulty. Everyone had their own way of structuring and formatting data, based on what was convenient for them. Users may purposely submit incorrect data values for mandatory 鍖elds when they do not wish to submit personal information . 5
  • 6. 6 2. Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of data from multiple data sources Data reduction Dimensionality reduction Numerosity reduction Data compression Data transformation and data discretization Normalization Data discretization (for numerical data) and Concept hierarchy generation
  • 7. Forms of data preprocessing 7
  • 8. 8 2.1. Data Cleaning Data cleaning tasks attempts to: Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration
  • 9. 9 Incomplete (Missing) Data: Data is not always available many tuples have no recorded value for several attributes, such as customer income in sales data. Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data.
  • 10. 10 How to Handle Missing Value? Ignore the tuple: usually done when class label is missing (when doing classification). Not effective method unless several attributes missing values Fill in the missing value manually: tedious + infeasible? Fill in it automatically with: a global constant : e.g., unknown, a new class?! Use a measure of central tendency for the attribute (e.g., the mean or median) to 鍖ll in the missing value Average income of customer $28,000 (use this value to replace). Use the most probable value : determined with regression, inference-based such as Bayesian formula, or decision tree. (most popular)
  • 11. How to Handle Missing Data? Age Income Religion Gender 23 24,200 Muslim M 39 ? Christian F 45 45,390 ? F Fill missing values using aggregate functions (e.g., average) or probabilistic estimates on global value distribution E.g., put the average income here, or put the most probable income based on the fact that the person is 39 years old E.g., put the most frequent religion here 11
  • 12. 12 Noisy Data Noise-is a random error or variance in a measured variable Incorrect attribute values may be due to faulty data collection instruments(e.g.: OCR) data entry problems-Let say green is written as rgeen data transmission problems technology limitation inconsistency in naming convention 12
  • 13. 13 How to Handle Noisy Data? Manually check all data : tedious + infeasible? Sort data by frequency green is more frequent than rgeen Works well for categorical data Use, say Numerical constraints to Catch Corrupt Data Weight cant be negative People cant have more than 2 parents Salary cant be less than Birr 300 Check for outliers (the case of the 8 meters man) check for correlated outliers using n-gram (pregnant male) People can be male People can be pregnant People cant be male AND pregnant 13
  • 14. 2.2. Data Integration Data integration combines data from multiple sources into a coherent store Because of the use of different sources, data that that is fine on its own may become problematic when we want to integrate it. Some of the issues are: Different formats and structures Conflicting and redundant data Data at different levels 14
  • 15. Data Integration: Formats Not everyone uses the same format. Do you agree? Schema integration: e.g., A.cust-id B.cust-# Integrate metadata from different sources Dates are especially problematic: 12/19/97 19/12/97 19/12/1997 19-12-97 Dec 19, 1997 19 December 1997 19th Dec. 1997 Are you frequently writing money as: Birr 200, Br. 200, 200 Birr, 15
  • 16. 16 Data Integration: Inconsistent Inconsistent data: containing discrepancies in codes or names, which is also the problem of lack of standardization / naming conventions. e.g., Age=26 vs. Birthday=03/07/1986 Some use 1,2,3 for rating; others A, B, C Data Integration: Conflicting Data Detecting and resolving data value conflicts For the same real world entity, attribute values from different sources are different Possible reasons: different representations, different scales, e.g., American vs. British units weight measurement: KG or pound Height measurement: meter or inch
  • 17. 17 2.3.Data Reduction Strategies Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Data reduction strategies Dimensionality reduction, Select best attributes or remove unimportant attributes Numerosity reduction Reduce data volume by choosing alternative, smaller forms of data representation Data compression
  • 18. 18 Data Reduction: Dimensionality Reduction Dimensionality reduction Helps to eliminate Irrelevant attributes and reduce noise: that contain no information useful for model development. E.g. is students' ID relevant to predict students' GPA? Helps to avoid redundant attributes : that contain duplicate information in one or more other attributes E.g. purchase price of a product & the amount of sales tax paid Reduce time and space required in model development Allow easier visualization Method: attribute subset selection One of the method to reduce dimensionality of data is by selecting best attributes
  • 19. 19 Data Reduction: Numerosity Reduction Different methods can be used, including Clustering and sampling Clustering Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only There are many choices of clustering definitions and clustering algorithms Sampling obtaining a small sample s to represent the whole data set N Key principle: Choose a representative subset of the data using suitable sampling technique
  • 20. 20 2.4. Data Transformation A function that maps the entire set of values of a given attribute to a new set of replacement values. such that each old value can be identified with one of the new values. Methods for data transformation Normalization: Scaled to fall within a smaller, specified range of values min-max normalization z-score normalization decimal scaling Discretization: Reduce data size by dividing the range of a continuous attribute into intervals. Discretization can be performed recursively on an attribute using method such as Binning: divide values into intervals Concept hierarchy climbing: organizes concepts (i.e., attribute values) hierarchically
  • 21. Data Transformation: Normalization min-max normalization z-score normalization normalization by decimal scaling A A A A A A min new min new max new min max min v v _ ) _ _ ( ' A A dev stand mean v v _ ' j v v 10 ' Where j is the smallest integer such that Max(| |)<1 ' v 21
  • 22. Example: Suppose that the minimum and maximum values for the attribute income are $12,000 and $98,000, respectively. We would like to map income to the range [0.0, 1.0]. Suppose that the mean and standard deviation of the values for the attribute income are $54,000 and $16,000, respectively. Suppose that the recorded values of A range from 986 to 917. 22
  • 23. Normalization Min-max normalization: Ex. Let income range $12,000 to $98,000 is normalized to [0.0, 1.0]. Then $73,600 is mapped to Z-score normalization (亮: mean, : standard deviation): Ex. Let 亮 = 54,000, = 16,000. Then, Decimal scaling: Suppose that the recorded values of A range from -986 to 917. To normalize by decimal scaling, we therefore divide each value by 1000 (i.e., j = 3) so that -986 normalizes to -0.986 and 917 normalizes to 0.917. 716 . 0 0 ) 0 0 . 1 ( 000 , 12 000 , 98 000 , 12 600 , 73 newMin newMin newMax min max min v v A A A ) ( ' A A v v ' 225 . 1 000 , 16 000 , 54 600 , 73 23
  • 24. Discretization and Concept Hierarchy Discretization reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values Example: Binning methods equal-width, equal-frequency 24
  • 25. Binning Attribute values (for one attribute e.g., age): 0, 4, 12, 16, 16, 18, 24, 26, 28 Equi-width binning for bin width of e.g., 10: Bin 1: 0, 4 [-,10) bin Bin 2: 12, 16, 16, 18 [10,20) bin Bin 3: 24, 26, 28 [20,+) bin denote negative infinity, + positive infinity Equi-frequency binning for bin density of e.g., 3: Bin 1: 0, 4, 12 [-, 14) bin Bin 2: 16, 16, 18 [14, 21) bin Bin 3: 24, 26, 28 [21,+] bin 25
  • 26. Concept Hierarchy Generation Concept hierarchy: organizes concepts (i.e., attribute values) hierarchically. Concept hierarchy formation: Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as child, youth, adult, or senior) Concept hierarchies can be explicitly specified by domain experts. country Region or state city Sub city Kebele It can be automatically formed by the analysis of the number of distinct values. E.g., for a set of attributes: {Kebele, city, state, country} For numeric data, use discretization methods. 26
  • 27. 3. Dataset Dataset is a collection of data objects and their attributes An attribute is a property or characteristic of an object Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, dimension, or feature A collection of attributes describe an object Object is also known as record, point, case, sample, entity, or instance Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Attributes Objects
  • 28. Types of Attributes The type of an attribute is determined by the set of possible values : nominal, binary, ordinal, or numericthe attribute can have. There are different types of attributes Nominal- means relating to names . The values of a nominal attribute are symbols or names of things. Nominal attributes are also referred to as categorical. Examples: hair-color( Black, Brown, Blond etc.) , Marital- Status(Single, married, divorced and Widowed), Occupation etc. Ordinal: an attribute with possible values that have a meaningful order or ranking among them Examples: rankings (e.g., grades, height {tall, medium, short} 28
  • 29. Types of Attributes.. Binary : is a nominal attribute with only two categories or states: 0-absent or 1-present , Boolean( true or false) Example: Smoker(0-not smoker or 1-smoker) Interval-Scaled : Numeric Attributes are measured on a scale of equal-size units. allow us to compare and quantify the difference between values Examples: calendar dates, temperatures in Celsius or Fahrenheit. Ratio-Scaled: Numeric Attributes a value as being a multiple (or ratio) of another value Examples: temperature in length, time, counts 29
  • 30. Datasets preparation for learning A standard machine learning technique is to divide the dataset into a training set and a test set. Training dataset is used for model development. Test dataset, which is never seen during model development stage and used to evaluates the accuracy of the model. There are various ways in which to separate the data into training and test sets The holdout method Cross-validation The bootstrap 30
  • 31. The holdout method In this methods, the given data are randomly partitioned into two independent sets, a training set and a test set. Usually: one third for testing, the rest for training For small or unbalanced datasets, samples might not be representative Few or none instances of some classes Stratified sample: advanced version of balancing the data Make sure that each class is represented with approximately equal proportions in both subsets. Random subsampling : a variation of the holdout method in which the holdout method is repeated k times. The overall accuracy estimate is taken as the average of the accuracies obtained from each iteration.
  • 32. 32 Cross-validation Cross-validation works as follows: First step: data is split into k subsets of equal-sized sets randomly. A partition of a set is a collection of subsets for which the intersection of any pair of sets is empty. That is, no element of one subset is an element of another subset in a partition. Second step: each subset in turn is used for testing and the remainder for training This is called k-fold cross-validation Often the subsets are stratified before the cross-validation is performed The error estimates are averaged to yield an overall error estimate.
  • 33. 33 Cross-validation example: Break up data into groups of the same size Hold aside one group for testing and use the rest to build model Repeat Test 33
  • 34. Bootstrap the bootstrap method samples the given training tuples uniformly with replacement the machine is allowed to select the same tuple more than once. A commonly used one is the .632 bootstrap Suppose we are given a data set of d tuples. The data set is sampled d times, with replacement, resulting in a bootstrap sample or training set of d samples. The data tuples that did not make it into the training set end up forming the test set. on average, 63.2% of the original data tuples will end up in the bootstrap sample, and the remaining 36.8% will form the test set (hence, the name, .632 bootstrap) 34
  • 35. Assignment Explain PCA(Principal Component Analysis)how How it works Advantage and disadvantage 35