際際滷

際際滷Share a Scribd company logo
Fall 2024-25
Precision Agriculture
Dr. C. Moganapriya
moganapriya.c@vit.ac.in
Data Preprocessing
2
Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation
3
3
Data Quality: Why Preprocess the Data?
 Measures for data quality: A multidimensional view
 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, 
 Consistency: some modified but some not, dangling, 
 Timeliness: timely update?

Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?
4
Major Tasks in Data Preprocessing
 Data cleaning
 Data integration
 Data reduction
 Data transformation
5
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data

e.g., Occupation=  (missing data)

noisy: containing noise, errors, or outliers

e.g., Salary=10 (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,

Age=42, Birthday=03/07/2010

Was rating 1, 2, 3, now rating A, B, C

discrepancy between duplicate records

Intentional (e.g., disguised missing data)

Jan. 1 as everyones birthday?
6
How to Handle Missing Data?
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., unknown, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the
same class: smarter
 the most probable value: inference-based such as
Bayesian formula or decision tree
 Manual- small data set
 Automatic  larger data set more efficient
7
8
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to

faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which require data cleaning

duplicate records
 incomplete data

inconsistent data
How to Handle Noisy Data?
 Binning

first sort data and partition into (equal-frequency) bins

then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
 Regression

smooth by fitting the data into regression functions
 Clustering

Similar item are grouped and detect and remove
outliers
 Combined computer and human inspection

detect suspicious values and check by human (e.g., deal
with possible outliers)
9
Data Integration
 Data integration:
 Combines data from multiple heterogeneous sources into a
coherent store
 2 types
 Tight Coupling
 Data is combined together into a physical location
 Loose coupling
 only an interface is created, and the data is combined through
the interface and accessed through the interface
 data reminds in actual database only
10
10
Data Reduction
 Volume off data is reduced to make analysis easier
11
11
Data Reduction
 Dimensionality reduction
 Reduces the number of input variables in the data set because
the large input variables will result in poor performance
 Data cube aggregation
 Data is combined to construct a data cube
 Attribute subset selection

highly relevant attributes should be used, and other attributes
should be discarded or removed, so in this way the data can be
reduced
 Numerosity Reduction:
 Here, we store only model of data instead of entire data

Parametric

Non-parametric: Histogram, Cluster, Sampling
12
12
Data Reduction
13
13
Data cube aggregation
Data Compression
14
Original Data Compressed
Data
lossless
Original Data
Approximated
lossy
Data Compression
15
Data Transformation
 Data is transformed into appropriate form suitable for mining
process
 There are 4 methods in data transformation
 1. Normalization
 2. Attribute selection
 3. Discretization
 4. Concept hierarchy generation
1. Normalization
Normalization is done in order to scale the data values in a
specified range
For example, -1.0 to + 1.0 or 0 to 1
16
16
Data Transformation
2. Attribute selection
New attributes are created using older ones
3. Discretization
Raw values are replaced by interval values
4. Concept hierarchy generation
Attributes are converted from low level to the higher level
Example: city to country
17
17
Automatic Concept Hierarchy Generation
 Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
 The attribute with the most distinct values is placed at
the lowest level of the hierarchy
 Exceptions, e.g., weekday, month, quarter, year
18
country
province_or_ state
city
street
15 distinct values
365 distinct values
3567 distinct values
674,339 distinct values
Automatic Concept Hierarchy Generation
19
Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
20
21
Discretization Without Using Class Labels
(Binning vs. Clustering)
Data Equal interval width
(binning)
Equal frequency (binning) K-means clustering leads to better
results
Summary
 Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
 Data cleaning: e.g. missing/noisy values, outliers
 Data integration from multiple sources:
 Entity identification problem

Remove redundancies
 Detect inconsistencies
 Data reduction
 Dimensionality reduction

Numerosity reduction

Data compression
 Data transformation

Normalization
 Concept hierarchy generation
22

More Related Content

Data preprocessing in precision agriculture

  • 1. Fall 2024-25 Precision Agriculture Dr. C. Moganapriya moganapriya.c@vit.ac.in
  • 3. Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data Reduction Data Transformation 3 3
  • 4. Data Quality: Why Preprocess the Data? Measures for data quality: A multidimensional view Accuracy: correct or wrong, accurate or not Completeness: not recorded, unavailable, Consistency: some modified but some not, dangling, Timeliness: timely update? Believability: how trustable the data are correct? Interpretability: how easily the data can be understood? 4
  • 5. Major Tasks in Data Preprocessing Data cleaning Data integration Data reduction Data transformation 5
  • 6. Data Cleaning Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., Occupation= (missing data) noisy: containing noise, errors, or outliers e.g., Salary=10 (an error) inconsistent: containing discrepancies in codes or names, e.g., Age=42, Birthday=03/07/2010 Was rating 1, 2, 3, now rating A, B, C discrepancy between duplicate records Intentional (e.g., disguised missing data) Jan. 1 as everyones birthday? 6
  • 7. How to Handle Missing Data? Fill in the missing value manually: tedious + infeasible? Fill in it automatically with a global constant : e.g., unknown, a new class?! the attribute mean the attribute mean for all samples belonging to the same class: smarter the most probable value: inference-based such as Bayesian formula or decision tree Manual- small data set Automatic larger data set more efficient 7
  • 8. 8 Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may be due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which require data cleaning duplicate records incomplete data inconsistent data
  • 9. How to Handle Noisy Data? Binning first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Regression smooth by fitting the data into regression functions Clustering Similar item are grouped and detect and remove outliers Combined computer and human inspection detect suspicious values and check by human (e.g., deal with possible outliers) 9
  • 10. Data Integration Data integration: Combines data from multiple heterogeneous sources into a coherent store 2 types Tight Coupling Data is combined together into a physical location Loose coupling only an interface is created, and the data is combined through the interface and accessed through the interface data reminds in actual database only 10 10
  • 11. Data Reduction Volume off data is reduced to make analysis easier 11 11
  • 12. Data Reduction Dimensionality reduction Reduces the number of input variables in the data set because the large input variables will result in poor performance Data cube aggregation Data is combined to construct a data cube Attribute subset selection highly relevant attributes should be used, and other attributes should be discarded or removed, so in this way the data can be reduced Numerosity Reduction: Here, we store only model of data instead of entire data Parametric Non-parametric: Histogram, Cluster, Sampling 12 12
  • 14. Data Compression 14 Original Data Compressed Data lossless Original Data Approximated lossy
  • 16. Data Transformation Data is transformed into appropriate form suitable for mining process There are 4 methods in data transformation 1. Normalization 2. Attribute selection 3. Discretization 4. Concept hierarchy generation 1. Normalization Normalization is done in order to scale the data values in a specified range For example, -1.0 to + 1.0 or 0 to 1 16 16
  • 17. Data Transformation 2. Attribute selection New attributes are created using older ones 3. Discretization Raw values are replaced by interval values 4. Concept hierarchy generation Attributes are converted from low level to the higher level Example: city to country 17 17
  • 18. Automatic Concept Hierarchy Generation Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is placed at the lowest level of the hierarchy Exceptions, e.g., weekday, month, quarter, year 18 country province_or_ state city street 15 distinct values 365 distinct values 3567 distinct values 674,339 distinct values
  • 20. Binning Methods for Data Smoothing Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 20
  • 21. 21 Discretization Without Using Class Labels (Binning vs. Clustering) Data Equal interval width (binning) Equal frequency (binning) K-means clustering leads to better results
  • 22. Summary Data quality: accuracy, completeness, consistency, timeliness, believability, interpretability Data cleaning: e.g. missing/noisy values, outliers Data integration from multiple sources: Entity identification problem Remove redundancies Detect inconsistencies Data reduction Dimensionality reduction Numerosity reduction Data compression Data transformation Normalization Concept hierarchy generation 22