�ݺ�ߣ

Fall 2024-25
Precision Agriculture
Dr. C. Moganapriya
moganapriya.c@vit.ac.in

Data Preprocessing
 Data Preprocessing: An Overview
 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation
3
3

Data Quality: Why Preprocess the Data?
 Measures for data quality: A multidimensional view
 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling, …
 Timeliness: timely update?

Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?
4

Major Tasks in Data Preprocessing
 Data cleaning
 Data integration
 Data reduction
 Data transformation
5

Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data

e.g., Occupation=“ ” (missing data)

noisy: containing noise, errors, or outliers

e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,

Age=“42”, Birthday=“03/07/2010”

Was rating “1, 2, 3”, now rating “A, B, C”

discrepancy between duplicate records

Intentional (e.g., disguised missing data)

Jan. 1 as everyone’s birthday?
6

How to Handle Missing Data?
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the
same class: smarter
 the most probable value: inference-based such as
Bayesian formula or decision tree
 Manual- small data set
 Automatic – larger data set –more efficient
7

8
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to

faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which require data cleaning

duplicate records
 incomplete data

inconsistent data

How to Handle Noisy Data?
 Binning

first sort data and partition into (equal-frequency) bins

then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
 Regression

smooth by fitting the data into regression functions
 Clustering

Similar item are grouped and detect and remove
outliers
 Combined computer and human inspection

detect suspicious values and check by human (e.g., deal
with possible outliers)
9

Data Integration
 Data integration:
 Combines data from multiple heterogeneous sources into a
coherent store
 2 types
 Tight Coupling
 Data is combined together into a physical location
 Loose coupling
 only an interface is created, and the data is combined through
the interface and accessed through the interface
 data reminds in actual database only
10
10

Data Reduction
 Volume off data is reduced to make analysis easier
11
11

Data Reduction
 Dimensionality reduction
 Reduces the number of input variables in the data set because
the large input variables will result in poor performance
 Data cube aggregation
 Data is combined to construct a data cube
 Attribute subset selection

highly relevant attributes should be used, and other attributes
should be discarded or removed, so in this way the data can be
reduced
 Numerosity Reduction:
 Here, we store only model of data instead of entire data

Parametric

Non-parametric: Histogram, Cluster, Sampling
12
12

Data Reduction
13
13
Data cube aggregation

Data Compression
14
Original Data Compressed
Data
lossless
Original Data
Approximated
lossy

Data Transformation
 Data is transformed into appropriate form suitable for mining
process
 There are 4 methods in data transformation
 1. Normalization
 2. Attribute selection
 3. Discretization
 4. Concept hierarchy generation
1. Normalization
Normalization is done in order to scale the data values in a
specified range
For example, -1.0 to + 1.0 or 0 to 1
16
16

Data Transformation
2. Attribute selection
New attributes are created using older ones
3. Discretization
Raw values are replaced by interval values
4. Concept hierarchy generation
Attributes are converted from low level to the higher level
Example: city to country
17
17

Automatic Concept Hierarchy Generation
 Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set
 The attribute with the most distinct values is placed at
the lowest level of the hierarchy
 Exceptions, e.g., weekday, month, quarter, year
18
country
province_or_ state
city
street
15 distinct values
365 distinct values
3567 distinct values
674,339 distinct values

Automatic Concept Hierarchy Generation
19

Binning Methods for Data Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
20

21
Discretization Without Using Class Labels
(Binning vs. Clustering)
Data Equal interval width
(binning)
Equal frequency (binning) K-means clustering leads to better
results

Summary
 Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
 Data cleaning: e.g. missing/noisy values, outliers
 Data integration from multiple sources:
 Entity identification problem

Remove redundancies
 Detect inconsistencies
 Data reduction
 Dimensionality reduction

Numerosity reduction

Data compression
 Data transformation

Normalization
 Concept hierarchy generation
22

�ݺ�ߣ

Data preprocessing in precision agriculture

More Related Content

Data preprocessing in precision agriculture