This document discusses data preprocessing for machine learning. It covers the importance of data preprocessing to clean and prepare raw data before building machine learning models. Specifically, it discusses tasks like data cleaning to handle missing values, noisy data and outliers. It also covers data integration, reduction and transformation techniques such as normalization, discretization and concept hierarchy generation. The goal of these techniques is to improve data quality and make it suitable for machine learning algorithms.
Data preprocessing is important for obtaining quality data mining results. It involves cleaning data by handling missing values, outliers, and inconsistencies. It also includes integrating, transforming, reducing and discretizing data. The document outlines various techniques for each task such as mean imputation, binning, and clustering for cleaning noisy data. Dimensionality reduction techniques like feature selection and data compression algorithms are also discussed.
Data preprocessing is important for obtaining quality data mining results. It involves cleaning data by handling missing values, outliers, and inconsistencies. It also includes integrating, transforming, reducing and discretizing data. The document outlines various techniques for each task such as mean imputation, binning, and clustering for cleaning noisy data. Dimensionality reduction techniques like feature selection and data compression algorithms are also discussed.
Data preprocessing involves cleaning data by handling missing values, outliers, and inconsistencies. It also includes integrating and transforming data through normalization, aggregation, and dimensionality reduction. The goals are to improve data quality and reduce data volume for mining while maintaining the essential information. Techniques like binning, clustering, regression and histograms are used to discretize and reduce numerical attributes.
Data preprocessing involves cleaning data by handling missing values, noise, and inconsistencies. It also includes integrating and transforming data through normalization, aggregation, and dimensionality reduction. The goals are to improve data quality and reduce data volume for mining while maintaining the essential information. Techniques include data cleaning, integration, transformation, reduction, discretization, and generating concept hierarchies.
Data preprocessing involves cleaning data by handling missing values, noise, and inconsistencies. It also includes integrating and transforming data through normalization, aggregation, and dimensionality reduction. The goals are to improve data quality and reduce data volume for mining while maintaining the essential information. Techniques like binning, clustering, regression and histograms are used to discretize and reduce numerical attributes.
Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for modeling. It addresses issues like missing values, noise, inconsistencies, and redundancy. Techniques include data cleaning (e.g. filling in missing values), integration, normalization, aggregation, dimensionality reduction, and discretization which reduces data volume while maintaining analytical ability. The goal is obtaining quality data for quality analysis and mining results.
Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for modeling. It addresses issues like missing values, noise, inconsistencies, and redundancy. Techniques include data cleaning (e.g. filling in missing values), integration, normalization, aggregation, dimensionality reduction, and discretization which reduces data volume while maintaining analytical ability. The goal is obtaining quality data for quality analysis and mining results.
Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for modeling. It addresses issues like missing values, noise, inconsistencies, and redundancy. Techniques include data cleaning (e.g. filling in missing values), integration, normalization, aggregation, dimensionality reduction, and discretization which reduces data volume while maintaining analytical ability. The goal is obtaining quality data for quality analysis and mining results.
Data preprocessing involves cleaning data by handling missing values, noise, and inconsistencies. It also includes integrating and transforming data through normalization, aggregation, and dimensionality reduction. The goals are to improve data quality and reduce data volume for mining while maintaining the essential information. Techniques like binning, clustering, regression and histograms are used to discretize and reduce numerical attributes.
This document discusses various techniques for data preprocessing, including data cleaning, integration, transformation, and reduction. It describes why preprocessing is important for obtaining quality data and mining results. Common preprocessing tasks involve handling missing data, smoothing noisy data, and integrating data from multiple sources. Techniques like normalization, attribute construction, discretization, and dimensionality reduction are presented as methods for transforming and reducing data.
This document discusses various techniques for data preprocessing, including data cleaning, integration, transformation, and reduction. It describes why preprocessing is important for obtaining quality data and mining results. Key techniques covered include handling missing data, smoothing noisy data, data integration and normalization for transformation, and data reduction methods like binning, discretization, feature selection and dimensionality reduction.
Data preprocessing involves cleaning data by handling missing values and outliers, integrating multiple data sources, and transforming data through normalization, aggregation, and dimension reduction. The goals are to improve data quality, handle inconsistencies, and reduce data volume for analysis. Major tasks include data cleaning, integration, transformation, reduction through methods like feature selection, clustering, sampling and discretization of continuous variables. Preprocessing comprises the majority of work in data mining projects.
Data Preprocessing can be defined as a process of converting raw data into a format that is understandable and usable for further analysis. It is an important step in the Data Preparation stage. It ensures that the outcome of the analysis is accurate, complete, and consistent.
The document discusses data preprocessing techniques. It covers why preprocessing is important by addressing issues like incomplete, inaccurate, or inconsistent data. It then describes major tasks in preprocessing like data cleaning, integration, reduction, transformation. Data cleaning techniques discussed include handling missing values, removing noise, and resolving inconsistencies. The goal of preprocessing is to improve data quality and prepare it for data mining.
Data preprocessing involves cleaning data by handling missing values, outliers, and inconsistencies. It also includes integrating and transforming data by normalization, aggregation, and reduction. The document discusses techniques for data cleaning like binning and clustering to handle noisy data. It also covers data integration, transformation through normalization, and reduction using histograms, clustering, and sampling. Discretization and concept hierarchies are introduced as techniques to reduce continuous attributes for data analysis.
This document discusses data preprocessing techniques for data mining. It covers why preprocessing is important for obtaining quality mining results from quality data. The major tasks of data preprocessing are described, including data cleaning, integration, transformation, reduction, and discretization. Specific techniques for handling missing data, noisy data, and data integration are also outlined. The goals of data reduction strategies like dimensionality and numerosity reduction are explained.
Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for modeling and analysis. The document discusses several key aspects of data preprocessing including:
- Why data preprocessing is important to improve data quality and ensure accurate analysis results.
- Common data issues like missing values, noise, inconsistencies that require cleaning. Techniques for cleaning include filling in missing data, identifying and handling outliers, and resolving inconsistencies.
- Methods for reducing data like binning, regression, clustering, sampling to obtain a smaller yet representative version of the data.
- The major tasks in preprocessing like data cleaning, integration, transformation, reduction and discretization which are aimed at handling real-world data issues.
This document provides an overview of key aspects of data preparation and processing for data mining. It discusses the importance of domain expertise in understanding data. The goals of data preparation are identified as cleaning missing, noisy, and inconsistent data; integrating data from multiple sources; transforming data into appropriate formats; and reducing data through feature selection, sampling, and discretization. Common techniques for each step are outlined at a high level, such as binning, clustering, and regression for handling noisy data. The document emphasizes that data preparation is crucial and can require 70-80% of the effort for effective real-world data mining.
Data preprocessing involves cleaning data by handling missing values, outliers, and inconsistencies. It also includes integrating and transforming data from multiple sources through normalization, aggregation, and dimensionality reduction. The goals of preprocessing are to improve data quality, reduce data size for analysis, and prepare data for mining algorithms through techniques like discretization and concept hierarchy generation.
Data preprocessing involves cleaning data by handling missing values, outliers, and noise. It also includes data integration and transformation through normalization, aggregation, and dimensionality reduction. The goals are to improve data quality, handle inconsistencies, and reduce data size for mining. Techniques include binning, clustering, sampling and discretization which create intervals or concept hierarchies to generalize continuous attributes for analysis.
Data preprocessing involves cleaning data by handling missing values, outliers, and noise. It also includes integrating and transforming data from multiple sources through normalization, aggregation, and dimensionality reduction. The goals of preprocessing are to improve data quality, handle inconsistencies, and reduce data volume for analysis while retaining essential information. Techniques include discretization, concept hierarchy generation, sampling, clustering, and developing histograms to obtain a reduced data representation.
Data preprocessing involves cleaning data by handling missing values, outliers, and noise. It also includes integrating and transforming data from multiple sources through normalization, aggregation, and dimensionality reduction. The goals of preprocessing are to improve data quality, reduce data size for analysis, and convert continuous attributes to discrete intervals or concepts. Preprocessing helps produce higher quality mining results.
Data preprocessing is an important step for data mining and warehousing. It involves cleaning data by handling missing values, outliers, and inconsistencies. It also includes integrating, transforming, and reducing data. The goals are to improve data quality, reduce data size, and prepare data for mining algorithms. Key techniques include data cleaning, discretization of continuous attributes, feature selection, and various data reduction methods like binning, clustering, and sampling. Preprocessing helps produce higher quality mining results from quality data.
Data preprocessing is crucial for data mining and includes data cleaning, integration, reduction, and discretization. The goals are to handle missing data, smooth noisy data, reduce inconsistencies, integrate multiple sources, and reduce data size while maintaining analytical results. Common techniques include filling in missing values, identifying outliers, aggregating data, feature selection, binning, clustering, and generating concept hierarchies to replace raw values with semantic concepts. Preprocessing addresses issues like dirty, incomplete, inconsistent data to produce high quality input for mining models and decisions.
Data preprocessing is an important step for data mining and warehousing. It involves cleaning data by handling missing values, outliers, and inconsistencies. It also includes integrating, transforming, and reducing data. The goals are to improve data quality, reduce data size, and prepare data for mining algorithms. Key techniques include data cleaning, discretization of continuous attributes, feature selection, and various data reduction methods like binning, clustering, and sampling. Preprocessing helps produce higher quality mining results based on higher quality input data.
Data preprocessing is crucial for data mining and includes data cleaning, integration, reduction, and discretization. The goals are to handle missing data, smooth noisy data, reduce inconsistencies, integrate multiple sources, and reduce data size while maintaining analytical results. Common techniques include filling in missing values, identifying and handling outliers, aggregating data, feature selection, normalization, binning, clustering, and generating concept hierarchies. Preprocessing addresses issues like dirty, incomplete, inconsistent or redundant data to improve mining quality and efficiency.
Data preprocessing involves cleaning data by handling missing values, noise, and inconsistencies. It also includes integrating and transforming data through normalization, aggregation, and dimensionality reduction. The goals are to improve data quality and reduce data volume for mining while maintaining the essential information. Techniques like binning, clustering, regression and histograms are used to discretize and reduce numerical attributes.
This document discusses various techniques for data preprocessing, including data cleaning, integration, transformation, and reduction. It describes why preprocessing is important for obtaining quality data and mining results. Common preprocessing tasks involve handling missing data, smoothing noisy data, and integrating data from multiple sources. Techniques like normalization, attribute construction, discretization, and dimensionality reduction are presented as methods for transforming and reducing data.
This document discusses various techniques for data preprocessing, including data cleaning, integration, transformation, and reduction. It describes why preprocessing is important for obtaining quality data and mining results. Key techniques covered include handling missing data, smoothing noisy data, data integration and normalization for transformation, and data reduction methods like binning, discretization, feature selection and dimensionality reduction.
Data preprocessing involves cleaning data by handling missing values and outliers, integrating multiple data sources, and transforming data through normalization, aggregation, and dimension reduction. The goals are to improve data quality, handle inconsistencies, and reduce data volume for analysis. Major tasks include data cleaning, integration, transformation, reduction through methods like feature selection, clustering, sampling and discretization of continuous variables. Preprocessing comprises the majority of work in data mining projects.
Data Preprocessing can be defined as a process of converting raw data into a format that is understandable and usable for further analysis. It is an important step in the Data Preparation stage. It ensures that the outcome of the analysis is accurate, complete, and consistent.
The document discusses data preprocessing techniques. It covers why preprocessing is important by addressing issues like incomplete, inaccurate, or inconsistent data. It then describes major tasks in preprocessing like data cleaning, integration, reduction, transformation. Data cleaning techniques discussed include handling missing values, removing noise, and resolving inconsistencies. The goal of preprocessing is to improve data quality and prepare it for data mining.
Data preprocessing involves cleaning data by handling missing values, outliers, and inconsistencies. It also includes integrating and transforming data by normalization, aggregation, and reduction. The document discusses techniques for data cleaning like binning and clustering to handle noisy data. It also covers data integration, transformation through normalization, and reduction using histograms, clustering, and sampling. Discretization and concept hierarchies are introduced as techniques to reduce continuous attributes for data analysis.
This document discusses data preprocessing techniques for data mining. It covers why preprocessing is important for obtaining quality mining results from quality data. The major tasks of data preprocessing are described, including data cleaning, integration, transformation, reduction, and discretization. Specific techniques for handling missing data, noisy data, and data integration are also outlined. The goals of data reduction strategies like dimensionality and numerosity reduction are explained.
Data preprocessing involves cleaning, transforming, and reducing raw data to prepare it for modeling and analysis. The document discusses several key aspects of data preprocessing including:
- Why data preprocessing is important to improve data quality and ensure accurate analysis results.
- Common data issues like missing values, noise, inconsistencies that require cleaning. Techniques for cleaning include filling in missing data, identifying and handling outliers, and resolving inconsistencies.
- Methods for reducing data like binning, regression, clustering, sampling to obtain a smaller yet representative version of the data.
- The major tasks in preprocessing like data cleaning, integration, transformation, reduction and discretization which are aimed at handling real-world data issues.
This document provides an overview of key aspects of data preparation and processing for data mining. It discusses the importance of domain expertise in understanding data. The goals of data preparation are identified as cleaning missing, noisy, and inconsistent data; integrating data from multiple sources; transforming data into appropriate formats; and reducing data through feature selection, sampling, and discretization. Common techniques for each step are outlined at a high level, such as binning, clustering, and regression for handling noisy data. The document emphasizes that data preparation is crucial and can require 70-80% of the effort for effective real-world data mining.
Data preprocessing involves cleaning data by handling missing values, outliers, and inconsistencies. It also includes integrating and transforming data from multiple sources through normalization, aggregation, and dimensionality reduction. The goals of preprocessing are to improve data quality, reduce data size for analysis, and prepare data for mining algorithms through techniques like discretization and concept hierarchy generation.
Data preprocessing involves cleaning data by handling missing values, outliers, and noise. It also includes data integration and transformation through normalization, aggregation, and dimensionality reduction. The goals are to improve data quality, handle inconsistencies, and reduce data size for mining. Techniques include binning, clustering, sampling and discretization which create intervals or concept hierarchies to generalize continuous attributes for analysis.
Data preprocessing involves cleaning data by handling missing values, outliers, and noise. It also includes integrating and transforming data from multiple sources through normalization, aggregation, and dimensionality reduction. The goals of preprocessing are to improve data quality, handle inconsistencies, and reduce data volume for analysis while retaining essential information. Techniques include discretization, concept hierarchy generation, sampling, clustering, and developing histograms to obtain a reduced data representation.
Data preprocessing involves cleaning data by handling missing values, outliers, and noise. It also includes integrating and transforming data from multiple sources through normalization, aggregation, and dimensionality reduction. The goals of preprocessing are to improve data quality, reduce data size for analysis, and convert continuous attributes to discrete intervals or concepts. Preprocessing helps produce higher quality mining results.
Data preprocessing is an important step for data mining and warehousing. It involves cleaning data by handling missing values, outliers, and inconsistencies. It also includes integrating, transforming, and reducing data. The goals are to improve data quality, reduce data size, and prepare data for mining algorithms. Key techniques include data cleaning, discretization of continuous attributes, feature selection, and various data reduction methods like binning, clustering, and sampling. Preprocessing helps produce higher quality mining results from quality data.
Data preprocessing is crucial for data mining and includes data cleaning, integration, reduction, and discretization. The goals are to handle missing data, smooth noisy data, reduce inconsistencies, integrate multiple sources, and reduce data size while maintaining analytical results. Common techniques include filling in missing values, identifying outliers, aggregating data, feature selection, binning, clustering, and generating concept hierarchies to replace raw values with semantic concepts. Preprocessing addresses issues like dirty, incomplete, inconsistent data to produce high quality input for mining models and decisions.
Data preprocessing is an important step for data mining and warehousing. It involves cleaning data by handling missing values, outliers, and inconsistencies. It also includes integrating, transforming, and reducing data. The goals are to improve data quality, reduce data size, and prepare data for mining algorithms. Key techniques include data cleaning, discretization of continuous attributes, feature selection, and various data reduction methods like binning, clustering, and sampling. Preprocessing helps produce higher quality mining results based on higher quality input data.
Data preprocessing is crucial for data mining and includes data cleaning, integration, reduction, and discretization. The goals are to handle missing data, smooth noisy data, reduce inconsistencies, integrate multiple sources, and reduce data size while maintaining analytical results. Common techniques include filling in missing values, identifying and handling outliers, aggregating data, feature selection, normalization, binning, clustering, and generating concept hierarchies. Preprocessing addresses issues like dirty, incomplete, inconsistent or redundant data to improve mining quality and efficiency.
Selzy: Simplifying Email Marketing for Maximum GrowthSelzy
油
This presentation is about Selzy, an easy-to-use and affordable email marketing tool that helps businesses create and launch effective email campaigns with minimal effort. It highlights the challenges of traditional email marketing, showcases Selzys AI-powered email builder, fast setup, and 24/7 support, and demonstrates the tools impact through user growth and market potential. With a strong ROI and a rapidly expanding customer base, Selzy positions itself as a powerful yet simple solution for businesses looking to boost engagement and sales through email marketing.
The AI Solution that meets the deep-tech needs of the process manufacturing industry.
Using the latest AI/ML techniques along with advanced engineering methods, inSis AI provides diverse use-cases for manufacturing industries to unlock the full potential of their data.
Here are five ways inSis AI empowers manufacturing units to drive efficiency and innovation:
1. Predict & Prevent Anomalies
Leveraging Advanced Pattern Recognition and Neural Networks, inSis AI detects anomalies early and estimates the time to act. Example: Predicting catalyst exhaustion or imminent pump failure before it disrupts operations.
2. Real-Time Batch Process Monitoring
Monitors batch processes continuously and detects deviations from optimal conditions and provides probable root causes to enable quick corrective action, preventing batch failures.
3. Quality Prediction & Assurance
Uses real-time operational data to predict product and intermediate product quality and helps operators address deviations proactively, reducing reliance on post-production lab tests.
4. Optimized Process Control
Uses AI-driven process models and optimization algorithms to recommend the best operating parameters. Example: Enhancing reactor yield by optimizing steam, pressure, and temperature levels.
5. KPI Monitoring & Root Cause Analysis
AI/ML models analyze the relationship between KPIs and process variables and identifies root causes of KPI deviations in real time, enabling quick corrections and continuous improvement.
Beyond traditional AI/ML, inSis AI offers a Generative AI-powered assistant that provides instant process insights and enhances team productivity.
How can Competitive Intelligence Platforms benefit a Business?Contify
油
Competitive intelligence platforms help businesses stay ahead by analyzing market trends, tracking competitors, and identifying growth opportunities. They provide real-time insights, improving decision-making and strategic planning. With data-driven analysis, businesses can optimize marketing, enhance product development, and gain a competitive edge, ensuring long-term success in a dynamic market.
For more information please visit here https://www.contify.com/platform/
2. 1. Overview of data preprocessing
Machine Learning requires collecting great amount of data to
achieve the intended objective.
A real-world data generally contains an unusable format which
cannot be directly used for machine learning models.
Before feeding data to ML, we have to make sure the quality of
data?
Data preprocessing is a process of preparing the raw data and
making it suitable for a machine learning model.
It is the crucial step while creating a machine learning model.
It increases the accuracy and efficiency of a machine learning
model.
3. Data Quality
A well-accepted multidimensional data quality
measures are the following:
Accuracy (free from errors and outliers)
Completeness (no missing attributes and values)
Consistency (no inconsistent values and attributes)
Timeliness (appropriateness of the data for the purpose it is
required)
Believability (acceptability)
Interpretability (easy to understand)
3
4. Why Data Preprocessing?
Most of the data in the real world are poor quality
(Incomplete, Inconsistent, Noisy, Invalid, Redundant, )
incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
e.g., occupation=
noisy: containing errors or outliers
e.g., Salary=-10
inconsistent: containing discrepancies in codes or names
e.g., Age=42 Birthday=03/07/1997
e.g., Was rating 1,2,3, now rating A, B, C
Redundant: including everything, some of which are
irrelevant to our task.
No quality data, no quality results!
4
5. Data is often of low quality
Collecting the required data is challenging
Why?
You didnt collect it yourself
It probably was created for some other use, and then you came
along wanting to integrate it.
People make mistakes (typos)
Data collection instruments used may be faulty.
Everyone had their own way of structuring and formatting data,
based on what was convenient for them.
Users may purposely submit incorrect data values for
mandatory 鍖elds when they do not wish to submit personal
information .
5
6. 6
2. Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
Data integration
Integration of data from multiple data sources
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Data discretization (for numerical data) and Concept hierarchy generation
8. 8
2.1. Data Cleaning
Data cleaning tasks attempts to:
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
9. 9
Incomplete (Missing) Data:
Data is not always available
many tuples have no recorded value for several attributes,
such as customer income in sales data.
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data.
10. 10
How to Handle Missing Value?
Ignore the tuple:
usually done when class label is missing (when doing
classification).
Not effective method unless several attributes missing values
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with:
a global constant : e.g., unknown, a new class?!
Use a measure of central tendency for the attribute (e.g., the
mean or median) to 鍖ll in the missing value
Average income of customer $28,000 (use this value to
replace).
Use the most probable value :
determined with regression, inference-based such as Bayesian
formula, or decision tree. (most popular)
11. How to Handle Missing Data?
Age Income Religion Gender
23 24,200 Muslim M
39 ? Christian F
45 45,390 ? F
Fill missing values using aggregate functions (e.g., average) or probabilistic
estimates on global value distribution
E.g., put the average income here, or put the most probable income based
on the fact that the person is 39 years old
E.g., put the most frequent religion here
11
12. 12
Noisy Data
Noise-is a random error or variance in a measured
variable
Incorrect attribute values may be due to
faulty data collection instruments(e.g.: OCR)
data entry problems-Let say green is written as rgeen
data transmission problems
technology limitation
inconsistency in naming convention
12
13. 13
How to Handle Noisy Data?
Manually check all data : tedious + infeasible?
Sort data by frequency
green is more frequent than rgeen
Works well for categorical data
Use, say Numerical constraints to Catch Corrupt Data
Weight cant be negative
People cant have more than 2 parents
Salary cant be less than Birr 300
Check for outliers (the case of the 8 meters man)
check for correlated outliers using n-gram (pregnant
male)
People can be male
People can be pregnant
People cant be male AND pregnant
13
14. 2.2. Data Integration
Data integration combines data from multiple sources
into a coherent store
Because of the use of different sources, data that that is
fine on its own may become problematic when we want
to integrate it.
Some of the issues are:
Different formats and structures
Conflicting and redundant data
Data at different levels
14
15. Data Integration: Formats
Not everyone uses the same format. Do you agree?
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Dates are especially problematic:
12/19/97
19/12/97
19/12/1997
19-12-97
Dec 19, 1997
19 December 1997
19th Dec. 1997
Are you frequently writing money as:
Birr 200, Br. 200, 200 Birr,
15
16. 16
Data Integration: Inconsistent
Inconsistent data: containing discrepancies in codes or
names, which is also the problem of lack of
standardization / naming conventions. e.g.,
Age=26 vs. Birthday=03/07/1986
Some use 1,2,3 for rating; others A, B, C
Data Integration: Conflicting Data
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different
sources are different
Possible reasons: different representations, different scales, e.g.,
American vs. British units
weight measurement: KG or pound
Height measurement: meter or inch
17. 17
2.3.Data Reduction Strategies
Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
Data reduction strategies
Dimensionality reduction,
Select best attributes or remove unimportant attributes
Numerosity reduction
Reduce data volume by choosing alternative, smaller forms of
data representation
Data compression
18. 18
Data Reduction: Dimensionality Reduction
Dimensionality reduction
Helps to eliminate Irrelevant attributes and reduce noise: that
contain no information useful for model development.
E.g. is students' ID relevant to predict students' GPA?
Helps to avoid redundant attributes : that contain duplicate
information in one or more other attributes
E.g. purchase price of a product & the amount of sales tax paid
Reduce time and space required in model development
Allow easier visualization
Method: attribute subset selection
One of the method to reduce dimensionality of data is by selecting
best attributes
19. 19
Data Reduction: Numerosity Reduction
Different methods can be used, including Clustering and
sampling
Clustering
Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
There are many choices of clustering definitions and clustering
algorithms
Sampling
obtaining a small sample s to represent the whole data set N
Key principle: Choose a representative subset of the data using
suitable sampling technique
20. 20
2.4. Data Transformation
A function that maps the entire set of values of a given
attribute to a new set of replacement values. such that
each old value can be identified with one of the new
values.
Methods for data transformation
Normalization: Scaled to fall within a smaller, specified range of
values
min-max normalization
z-score normalization
decimal scaling
Discretization: Reduce data size by dividing the range of a
continuous attribute into intervals.
Discretization can be performed recursively on an attribute
using method such as
Binning: divide values into intervals
Concept hierarchy climbing: organizes concepts (i.e., attribute
values) hierarchically
21. Data Transformation: Normalization
min-max normalization
z-score normalization
normalization by decimal scaling
A
A
A
A
A
A
min
new
min
new
max
new
min
max
min
v
v _
)
_
_
(
'
A
A
dev
stand
mean
v
v
_
'
j
v
v
10
' Where j is the smallest integer such that
Max(| |)<1
'
v
21
22. Example:
Suppose that the minimum and maximum values for the
attribute income are $12,000 and $98,000, respectively.
We would like to map income to the range [0.0, 1.0].
Suppose that the mean and standard deviation of the
values for the attribute income are $54,000 and $16,000,
respectively.
Suppose that the recorded values of A range from 986 to
917.
22
23. Normalization
Min-max normalization:
Ex. Let income range $12,000 to $98,000 is normalized to
[0.0, 1.0]. Then $73,600 is mapped to
Z-score normalization (亮: mean, : standard deviation):
Ex. Let 亮 = 54,000, = 16,000. Then,
Decimal scaling: Suppose that the recorded values of A range from -986 to
917. To normalize by decimal scaling, we therefore divide each value by 1000
(i.e., j = 3) so that -986 normalizes to -0.986 and 917 normalizes to 0.917.
716
.
0
0
)
0
0
.
1
(
000
,
12
000
,
98
000
,
12
600
,
73
newMin
newMin
newMax
min
max
min
v
v
A
A
A
)
(
'
A
A
v
v
'
225
.
1
000
,
16
000
,
54
600
,
73
23
24. Discretization and Concept Hierarchy
Discretization
reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals.
Interval labels can then be used to replace actual data values
Example:
Binning methods equal-width, equal-frequency
24
25. Binning
Attribute values (for one attribute e.g., age):
0, 4, 12, 16, 16, 18, 24, 26, 28
Equi-width binning for bin width of e.g., 10:
Bin 1: 0, 4 [-,10) bin
Bin 2: 12, 16, 16, 18 [10,20) bin
Bin 3: 24, 26, 28 [20,+) bin
denote negative infinity, + positive infinity
Equi-frequency binning for bin density of e.g.,
3:
Bin 1: 0, 4, 12 [-, 14) bin
Bin 2: 16, 16, 18 [14, 21) bin
Bin 3: 24, 26, 28 [21,+] bin
25
26. Concept Hierarchy Generation
Concept hierarchy:
organizes concepts (i.e., attribute values)
hierarchically.
Concept hierarchy formation:
Recursively reduce the data by collecting
and replacing low level concepts (such as
numeric values for age) by higher level
concepts (such as child, youth, adult, or
senior)
Concept hierarchies can be explicitly
specified by domain experts.
country
Region or state
city
Sub city
Kebele
It can be automatically formed by the
analysis of the number of distinct
values. E.g., for a set of attributes:
{Kebele, city, state, country}
For numeric data, use discretization
methods.
26
27. 3. Dataset
Dataset is a collection of data
objects and their attributes
An attribute is a property or
characteristic of an object
Examples: eye color of a person,
temperature, etc.
Attribute is also known as variable,
field, characteristic, dimension, or
feature
A collection of attributes
describe an object
Object is also known as record,
point, case, sample, entity, or
instance
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects
28. Types of Attributes
The type of an attribute is determined by the set of possible values :
nominal, binary, ordinal, or numericthe attribute can have.
There are different types of attributes
Nominal- means relating to names .
The values of a nominal attribute are symbols or names of
things.
Nominal attributes are also referred to as categorical.
Examples: hair-color( Black, Brown, Blond etc.) , Marital-
Status(Single, married, divorced and Widowed), Occupation
etc.
Ordinal:
an attribute with possible values that have a meaningful order
or ranking among them
Examples: rankings (e.g., grades, height {tall, medium, short}
28
29. Types of Attributes..
Binary :
is a nominal attribute with only two categories or
states: 0-absent or 1-present , Boolean( true or false)
Example: Smoker(0-not smoker or 1-smoker)
Interval-Scaled : Numeric Attributes
are measured on a scale of equal-size units.
allow us to compare and quantify the difference between values
Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
Ratio-Scaled: Numeric Attributes
a value as being a multiple (or ratio) of another value
Examples: temperature in length, time, counts
29
30. Datasets preparation for learning
A standard machine learning technique is to divide the dataset into a
training set and a test set.
Training dataset is used for model development.
Test dataset, which is never seen during model development stage and used to
evaluates the accuracy of the model.
There are various ways in which to separate the data into training
and test sets
The holdout method
Cross-validation
The bootstrap
30
31. The holdout method
In this methods, the given data are randomly partitioned
into two independent sets, a training set and a test set.
Usually: one third for testing, the rest for training
For small or unbalanced datasets, samples might not
be representative
Few or none instances of some classes
Stratified sample: advanced version of balancing the
data
Make sure that each class is represented with approximately
equal proportions in both subsets.
Random subsampling : a variation of the holdout method in
which the holdout method is repeated k times.
The overall accuracy estimate is taken as the average of the
accuracies obtained from each iteration.
32. 32
Cross-validation
Cross-validation works as follows:
First step: data is split into k subsets of equal-sized sets
randomly.
A partition of a set is a collection of subsets for which the
intersection of any pair of sets is empty. That is, no element of
one subset is an element of another subset in a partition.
Second step: each subset in turn is used for testing and the
remainder for training
This is called k-fold cross-validation
Often the subsets are stratified before the cross-validation is
performed
The error estimates are averaged to yield an overall error
estimate.
33. 33
Cross-validation example:
Break up data into groups of the same size
Hold aside one group for testing and use the rest to build model
Repeat
Test
33
34. Bootstrap
the bootstrap method samples the given training tuples uniformly
with replacement
the machine is allowed to select the same tuple more than once.
A commonly used one is the .632 bootstrap
Suppose we are given a data set of d tuples. The data set is
sampled d times, with replacement, resulting in a bootstrap sample
or training set of d samples.
The data tuples that did not make it into the training set end up
forming the test set.
on average, 63.2% of the original data tuples will end up in the
bootstrap sample, and the remaining 36.8% will form the test set
(hence, the name, .632 bootstrap)
34