際際滷

際際滷Share a Scribd company logo
Sanjivani Rural Education Societys
Sanjivani College of Engineering, Kopargaon-423 603
(An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune)
NACC A Grade Accredited, ISO 9001:2015 Certified
Department of Computer Engineering
(NBA Accredited)
Prof. S. A. Shivarkar
Assistant Professor
Contact No.8275032712
Email- shivarkarsandipcomp@sanjivani.org.in
Subject- Supervised Modeling and AI Technologies (CO9401)
Unit II: Supervised Learning Decision Trees
Content
 Decision trees, Designing/Building of decision trees, Greedy algorithm,
Decision tree algorithm selection algorithm, Constraints of decision tree
algorithm, Use of Decision tree as a classifier as well as regressor,
 Attribute selection(Entropy, Information gain, GINI index)
Decision Tree Induction: Training Dataset
Decision Tree Induction: Training Dataset
age?
overcast
student? credit rating?
<=30 >40
no yes yes
yes
31..40
fair
excellent
yes
no
 Training data set: Buys_computer
 The data set follows an example of
Quinlans ID3 (Playing Tennis)
 Resulting tree:
Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are discretized in
advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or statistical measure
(e.g., information gain)
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning  majority voting
is employed for classifying the leaf
 There are no samples left
Brief Review of Entropy

m = 2
Brief Review of Entropy

m = 2
Attribute Selection Measure: Information Gain (ID3)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated
by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple in D:
 Information needed (after using A to split D into v partitions) to classify D:
 Information gained by branching on attribute A
)
(
log
)
( 2
1
i
m
i
i p
p
D
Info 



)
(
|
|
|
|
)
(
1
j
v
j
j
A D
Info
D
D
D
Info 
 

(D)
Info
Info(D)
Gain(A) A
Decision Tree Induction: Training Dataset Example 1
Attribute Selection: Information Gain
Decision Tree Induction: Training Dataset Example 2
Computing Information-Gain for Continuous-Value Attributes
 Let attribute A be a continuous-valued attribute
 Must determine the best split point for A
 Sort the value A in increasing order
 Typically, the midpoint between each pair of adjacent values is considered as a
possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
 The point with the minimum expected information requirement for A is
selected as the split-point for A
 Split:
 D1 is the set of tuples in D satisfying A  split-point, and D2 is the set of tuples
in D satisfying A > split-point
Gain Ratio for Attribute Selection (C4.5)
 Information gain measure is biased towards attributes with a
large number of values
 C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
 GainRatio(A) = Gain(A)/SplitInfo(A)
 Ex.
 gain_ratio(income) = 0.029/1.557 = 0.019
 The attribute with the maximum gain ratio is selected as the
splitting attribute
)
|
|
|
|
(
log
|
|
|
|
)
( 2
1 D
D
D
D
D
SplitInfo j
v
j
j
A
Attribute Selection (C4.5): Example 1
Department Age Salary Count Status
sales 3135 4650 30 senior
sales 2630 2630 40 junior
sales 3135 3135 40 junior
systems 2125 4650 20 junior
systems 2131 6670 5 senior
systems 2630 4650 3 junior
systems 4145 6670 3 senior
marketing 3640 4650 10 senior
marketing 3135 4145 4 junior
secretary 4650 3640 4 senior
secretary 2630 2630 6 junior
Training
data
from an
employee
Gini Index (CART, IBM IntelligentMiner)
 If a data set D contains examples from n classes, gini index, gini(D) is defined as
where pj is the relative frequency of class j in D
 If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is
defined as
 Reduction in Impurity:
 The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity)
is chosen to split the node (need to enumerate all the possible splitting points for
each attribute)




n
j
p j
D
gini
1
2
1
)
(
)
(
|
|
|
|
)
(
|
|
|
|
)
( 2
2
1
1
D
gini
D
D
D
gini
D
D
D
giniA


)
(
)
(
)
( D
gini
D
gini
A
gini A
Gini Index: Example 1
Computation of Gini Index
 Ex. D has 9 tuples in buys_computer = yes and 5 in no
 Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2
Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the {low,medium} (and
{high}) since it has the lowest Gini index
 All attributes are assumed continuous-valued
 May need other tools, e.g., clustering, to get the possible split values
 Can be modified for categorical attributes
459
.
0
14
5
14
9
1
)
(
2
2
















D
gini
)
(
14
4
)
(
14
10
)
( 2
1
}
,
{ D
Gini
D
Gini
D
gini medium
low
income
Comparing Attribute Selection Measures
 The three measures, in general, return good results but
 Information gain:
 biased towards multivalued attributes
 Gain ratio:
 tends to prefer unbalanced splits in which one partition is much smaller than the
others
 Gini index:
 biased to multivalued attributes
 has difficulty when # of classes is large
 tends to favor tests that result in equal-sized partitions and purity in both
partitions
Other Attribute Selection Measures
 CHAID: a popular decision tree algorithm, measure based on 2 test for independence
 C-SEP: performs better than info. gain and gini index in certain cases
 G-statistic: has a close approximation to 2 distribution
 MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred):
 The best tree as the one that requires the fewest # of bits to both (1) encode the tree,
and (2) encode the exceptions to the tree
 Multivariate splits (partition based on multiple variable combinations)
 CART: finds multivariate splits based on a linear comb. of attrs.
 Which attribute selection measure is the best?
 Most give good results, none is significantly superior than others
 Overfitting: An induced tree may overfit the training data
 Model tries to accommodate all data points.
 Too many branches, some may reflect anomalies due to noise or outliers
 Poor accuracy for unseen samples
 A solution to avoid overfitting is using a linear algorithm if we have linear data or
using the parameters like the maximal depth if we are using decision trees.
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early 無 do not split a node if this would result in the goodness
measure falling below a threshold
 Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a fully grown treeget a sequence of progressively
pruned trees
 Use a set of data different from the training data to decide which is the best pruned tree
Overfitting and Tree Pruning
 Underfitting: An induced tree may overfit the training data
 Model tries to accommodate very few data points e.g. 10% dataset for training and 90 % for
testing.
 It has very less accuracy.
 An underfit models are inaccurate, especially when applied to new,
unseen examples.
 Techniques to Reduce Underfitting
 Increase model complexity.
 Increase the number of features, performing feature engineering.
 Remove noise from the data.
 Increase the number of epochs or increase the duration of training to get better results.
Overfitting and Tree Pruning
Overfitting and Underfitting
Reasons for Overfitting:
1. High variance and low bias.
2.The model is too complex.
3.The size of the training data.
Reasons for Underfitting
1.If model not capable to represent the complexities in the data.
2.The size of the training dataset used is not enough.
3.Features are not scaled.
Overfitting and Underfitting
Enhancements to Basic Decision Tree Induction
 Allow for continuous-valued attributes
 Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of
intervals
 Handle missing attribute values
 Assign the most common value of the attribute
 Assign probability to each of the possible values
 Attribute construction
 Create new attributes based on existing ones that are
sparsely represented
 This reduces fragmentation, repetition, and replication
DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 25
Reference
 Han, Jiawei Kamber, Micheline Pei and Jian, Data Mining: Concepts and
Techniques,Elsevier Publishers, ISBN:9780123814791, 9780123814807.
 https://onlinecourses.nptel.ac.in/noc24_cs22
 https://medium.com/analytics-vidhya/type-of-distances-used-in-machine-
learning-algorithm-c873467140de
 https://www.freecodecamp.org/news/k-nearest-neighbors-algorithm-
classifiers-and-model-example/

More Related Content

More from ShivarkarSandip (20)

Data Warehouse and Architecture, OLAP Operation
Data Warehouse and Architecture, OLAP OperationData Warehouse and Architecture, OLAP Operation
Data Warehouse and Architecture, OLAP Operation
ShivarkarSandip
Data Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data CleaningData Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data Cleaning
ShivarkarSandip
Introduction to Data Mining, KDD Process, OLTP and OLAP
Introduction to Data Mining, KDD Process, OLTP and OLAPIntroduction to Data Mining, KDD Process, OLTP and OLAP
Introduction to Data Mining, KDD Process, OLTP and OLAP
ShivarkarSandip
Introduction to Data Mining KDD Process OLAP
Introduction to Data Mining KDD Process OLAPIntroduction to Data Mining KDD Process OLAP
Introduction to Data Mining KDD Process OLAP
ShivarkarSandip
Issues in data mining Patterns Online Analytical Processing
Issues in data mining  Patterns Online Analytical ProcessingIssues in data mining  Patterns Online Analytical Processing
Issues in data mining Patterns Online Analytical Processing
ShivarkarSandip
Introduction to data mining which covers the basics
Introduction to data mining which covers the basicsIntroduction to data mining which covers the basics
Introduction to data mining which covers the basics
ShivarkarSandip
Introduction to Data Communication.pdf
Introduction to Data Communication.pdfIntroduction to Data Communication.pdf
Introduction to Data Communication.pdf
ShivarkarSandip
Classification of Signal.pdf
Classification of Signal.pdfClassification of Signal.pdf
Classification of Signal.pdf
ShivarkarSandip
Sequential Circuit Design-2.pdf
Sequential Circuit Design-2.pdfSequential Circuit Design-2.pdf
Sequential Circuit Design-2.pdf
ShivarkarSandip
Sequential Ckt.pdf
Sequential Ckt.pdfSequential Ckt.pdf
Sequential Ckt.pdf
ShivarkarSandip
Flip Flop.pdf
Flip Flop.pdfFlip Flop.pdf
Flip Flop.pdf
ShivarkarSandip
Combinational Ckt.pdf
Combinational Ckt.pdfCombinational Ckt.pdf
Combinational Ckt.pdf
ShivarkarSandip
Boolean Algebra Terminologies.pdf
Boolean Algebra Terminologies.pdfBoolean Algebra Terminologies.pdf
Boolean Algebra Terminologies.pdf
ShivarkarSandip
Logic Minimization.pdf
Logic Minimization.pdfLogic Minimization.pdf
Logic Minimization.pdf
ShivarkarSandip
Unit III Introduction to DWH.pdf
Unit III Introduction to DWH.pdfUnit III Introduction to DWH.pdf
Unit III Introduction to DWH.pdf
ShivarkarSandip
Unit II Decision Making Basics and Concepts.pdf
Unit II Decision Making Basics and Concepts.pdfUnit II Decision Making Basics and Concepts.pdf
Unit II Decision Making Basics and Concepts.pdf
ShivarkarSandip
Unit I Factors Responsible for Successful BI Project.pdf
Unit I Factors Responsible for Successful BI Project.pdfUnit I Factors Responsible for Successful BI Project.pdf
Unit I Factors Responsible for Successful BI Project.pdf
ShivarkarSandip
Unit I Operational data Informational data.pdf
Unit I Operational data  Informational data.pdfUnit I Operational data  Informational data.pdf
Unit I Operational data Informational data.pdf
ShivarkarSandip
Unit I Introduction to BI.pdf
Unit I Introduction to BI.pdfUnit I Introduction to BI.pdf
Unit I Introduction to BI.pdf
ShivarkarSandip
Unit III Components and Arch of DWH ETL Process.pdf
Unit III  Components and Arch of DWH ETL Process.pdfUnit III  Components and Arch of DWH ETL Process.pdf
Unit III Components and Arch of DWH ETL Process.pdf
ShivarkarSandip
Data Warehouse and Architecture, OLAP Operation
Data Warehouse and Architecture, OLAP OperationData Warehouse and Architecture, OLAP Operation
Data Warehouse and Architecture, OLAP Operation
ShivarkarSandip
Data Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data CleaningData Preparation and Preprocessing , Data Cleaning
Data Preparation and Preprocessing , Data Cleaning
ShivarkarSandip
Introduction to Data Mining, KDD Process, OLTP and OLAP
Introduction to Data Mining, KDD Process, OLTP and OLAPIntroduction to Data Mining, KDD Process, OLTP and OLAP
Introduction to Data Mining, KDD Process, OLTP and OLAP
ShivarkarSandip
Introduction to Data Mining KDD Process OLAP
Introduction to Data Mining KDD Process OLAPIntroduction to Data Mining KDD Process OLAP
Introduction to Data Mining KDD Process OLAP
ShivarkarSandip
Issues in data mining Patterns Online Analytical Processing
Issues in data mining  Patterns Online Analytical ProcessingIssues in data mining  Patterns Online Analytical Processing
Issues in data mining Patterns Online Analytical Processing
ShivarkarSandip
Introduction to data mining which covers the basics
Introduction to data mining which covers the basicsIntroduction to data mining which covers the basics
Introduction to data mining which covers the basics
ShivarkarSandip
Introduction to Data Communication.pdf
Introduction to Data Communication.pdfIntroduction to Data Communication.pdf
Introduction to Data Communication.pdf
ShivarkarSandip
Classification of Signal.pdf
Classification of Signal.pdfClassification of Signal.pdf
Classification of Signal.pdf
ShivarkarSandip
Sequential Circuit Design-2.pdf
Sequential Circuit Design-2.pdfSequential Circuit Design-2.pdf
Sequential Circuit Design-2.pdf
ShivarkarSandip
Combinational Ckt.pdf
Combinational Ckt.pdfCombinational Ckt.pdf
Combinational Ckt.pdf
ShivarkarSandip
Boolean Algebra Terminologies.pdf
Boolean Algebra Terminologies.pdfBoolean Algebra Terminologies.pdf
Boolean Algebra Terminologies.pdf
ShivarkarSandip
Logic Minimization.pdf
Logic Minimization.pdfLogic Minimization.pdf
Logic Minimization.pdf
ShivarkarSandip
Unit III Introduction to DWH.pdf
Unit III Introduction to DWH.pdfUnit III Introduction to DWH.pdf
Unit III Introduction to DWH.pdf
ShivarkarSandip
Unit II Decision Making Basics and Concepts.pdf
Unit II Decision Making Basics and Concepts.pdfUnit II Decision Making Basics and Concepts.pdf
Unit II Decision Making Basics and Concepts.pdf
ShivarkarSandip
Unit I Factors Responsible for Successful BI Project.pdf
Unit I Factors Responsible for Successful BI Project.pdfUnit I Factors Responsible for Successful BI Project.pdf
Unit I Factors Responsible for Successful BI Project.pdf
ShivarkarSandip
Unit I Operational data Informational data.pdf
Unit I Operational data  Informational data.pdfUnit I Operational data  Informational data.pdf
Unit I Operational data Informational data.pdf
ShivarkarSandip
Unit I Introduction to BI.pdf
Unit I Introduction to BI.pdfUnit I Introduction to BI.pdf
Unit I Introduction to BI.pdf
ShivarkarSandip
Unit III Components and Arch of DWH ETL Process.pdf
Unit III  Components and Arch of DWH ETL Process.pdfUnit III  Components and Arch of DWH ETL Process.pdf
Unit III Components and Arch of DWH ETL Process.pdf
ShivarkarSandip

Recently uploaded (20)

Industrial Valves, Instruments Products Profile
Industrial Valves, Instruments Products ProfileIndustrial Valves, Instruments Products Profile
Industrial Valves, Instruments Products Profile
zebcoeng
Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...
Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...
Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...
slayshadow705
IPC-9716_2024 Requirements for Automated Optical Inspection (AOI) Process Con...
IPC-9716_2024 Requirements for Automated Optical Inspection (AOI) Process Con...IPC-9716_2024 Requirements for Automated Optical Inspection (AOI) Process Con...
IPC-9716_2024 Requirements for Automated Optical Inspection (AOI) Process Con...
ssuserd9338b
Mathematics_behind_machine_learning_INT255.pptx
Mathematics_behind_machine_learning_INT255.pptxMathematics_behind_machine_learning_INT255.pptx
Mathematics_behind_machine_learning_INT255.pptx
ppkmurthy2006
Cloud Computing concepts and technologies
Cloud Computing concepts and technologiesCloud Computing concepts and technologies
Cloud Computing concepts and technologies
ssuser4c9444
Best KNow Hydrogen Fuel Production in the World The cost in USD kwh for H2
Best KNow  Hydrogen Fuel Production in the World The cost in USD kwh for H2Best KNow  Hydrogen Fuel Production in the World The cost in USD kwh for H2
Best KNow Hydrogen Fuel Production in the World The cost in USD kwh for H2
Daniel Donatelli
RAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptx
RAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptxRAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptx
RAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptx
JenTeruel1
Lecture -3 Cold water supply system.pptx
Lecture -3 Cold water supply system.pptxLecture -3 Cold water supply system.pptx
Lecture -3 Cold water supply system.pptx
rabiaatif2
15. Smart Cities Big Data, Civic Hackers, and the Quest for a New Utopia.pdf
15. Smart Cities Big Data, Civic Hackers, and the Quest for a New Utopia.pdf15. Smart Cities Big Data, Civic Hackers, and the Quest for a New Utopia.pdf
15. Smart Cities Big Data, Civic Hackers, and the Quest for a New Utopia.pdf
NgocThang9
Indian Soil Classification System in Geotechnical Engineering
Indian Soil Classification System in Geotechnical EngineeringIndian Soil Classification System in Geotechnical Engineering
Indian Soil Classification System in Geotechnical Engineering
Rajani Vyawahare
Power Point Presentation for Electrical Engineering 3-phase.ppt
Power Point Presentation for Electrical Engineering 3-phase.pptPower Point Presentation for Electrical Engineering 3-phase.ppt
Power Point Presentation for Electrical Engineering 3-phase.ppt
Aniket_1415
GROUP-3-GRID-CODE-AND-DISTRIBUTION-CODE.pptx
GROUP-3-GRID-CODE-AND-DISTRIBUTION-CODE.pptxGROUP-3-GRID-CODE-AND-DISTRIBUTION-CODE.pptx
GROUP-3-GRID-CODE-AND-DISTRIBUTION-CODE.pptx
meneememoo
Optimization of Cumulative Energy, Exergy Consumption and Environmental Life ...
Optimization of Cumulative Energy, Exergy Consumption and Environmental Life ...Optimization of Cumulative Energy, Exergy Consumption and Environmental Life ...
Optimization of Cumulative Energy, Exergy Consumption and Environmental Life ...
J. Agricultural Machinery
How to Make an RFID Door Lock System using Arduino
How to Make an RFID Door Lock System using ArduinoHow to Make an RFID Door Lock System using Arduino
How to Make an RFID Door Lock System using Arduino
CircuitDigest
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptxMathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
ppkmurthy2006
Frankfurt University of Applied Science urkunde
Frankfurt University of Applied Science urkundeFrankfurt University of Applied Science urkunde
Frankfurt University of Applied Science urkunde
Lisa Emerson
TM-ASP-101-RF_Air Press manual crimping machine.pdf
TM-ASP-101-RF_Air Press manual crimping machine.pdfTM-ASP-101-RF_Air Press manual crimping machine.pdf
TM-ASP-101-RF_Air Press manual crimping machine.pdf
ChungLe60
Piping-and-pipeline-calculations-manual.pdf
Piping-and-pipeline-calculations-manual.pdfPiping-and-pipeline-calculations-manual.pdf
Piping-and-pipeline-calculations-manual.pdf
OMI0721
Equipment for Gas Metal Arc Welding Process
Equipment for Gas Metal Arc Welding ProcessEquipment for Gas Metal Arc Welding Process
Equipment for Gas Metal Arc Welding Process
AhmadKamil87
Cyber Security_ Protecting the Digital World.pptx
Cyber Security_ Protecting the Digital World.pptxCyber Security_ Protecting the Digital World.pptx
Cyber Security_ Protecting the Digital World.pptx
Harshith A S
Industrial Valves, Instruments Products Profile
Industrial Valves, Instruments Products ProfileIndustrial Valves, Instruments Products Profile
Industrial Valves, Instruments Products Profile
zebcoeng
Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...
Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...
Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...
slayshadow705
IPC-9716_2024 Requirements for Automated Optical Inspection (AOI) Process Con...
IPC-9716_2024 Requirements for Automated Optical Inspection (AOI) Process Con...IPC-9716_2024 Requirements for Automated Optical Inspection (AOI) Process Con...
IPC-9716_2024 Requirements for Automated Optical Inspection (AOI) Process Con...
ssuserd9338b
Mathematics_behind_machine_learning_INT255.pptx
Mathematics_behind_machine_learning_INT255.pptxMathematics_behind_machine_learning_INT255.pptx
Mathematics_behind_machine_learning_INT255.pptx
ppkmurthy2006
Cloud Computing concepts and technologies
Cloud Computing concepts and technologiesCloud Computing concepts and technologies
Cloud Computing concepts and technologies
ssuser4c9444
Best KNow Hydrogen Fuel Production in the World The cost in USD kwh for H2
Best KNow  Hydrogen Fuel Production in the World The cost in USD kwh for H2Best KNow  Hydrogen Fuel Production in the World The cost in USD kwh for H2
Best KNow Hydrogen Fuel Production in the World The cost in USD kwh for H2
Daniel Donatelli
RAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptx
RAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptxRAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptx
RAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptx
JenTeruel1
Lecture -3 Cold water supply system.pptx
Lecture -3 Cold water supply system.pptxLecture -3 Cold water supply system.pptx
Lecture -3 Cold water supply system.pptx
rabiaatif2
15. Smart Cities Big Data, Civic Hackers, and the Quest for a New Utopia.pdf
15. Smart Cities Big Data, Civic Hackers, and the Quest for a New Utopia.pdf15. Smart Cities Big Data, Civic Hackers, and the Quest for a New Utopia.pdf
15. Smart Cities Big Data, Civic Hackers, and the Quest for a New Utopia.pdf
NgocThang9
Indian Soil Classification System in Geotechnical Engineering
Indian Soil Classification System in Geotechnical EngineeringIndian Soil Classification System in Geotechnical Engineering
Indian Soil Classification System in Geotechnical Engineering
Rajani Vyawahare
Power Point Presentation for Electrical Engineering 3-phase.ppt
Power Point Presentation for Electrical Engineering 3-phase.pptPower Point Presentation for Electrical Engineering 3-phase.ppt
Power Point Presentation for Electrical Engineering 3-phase.ppt
Aniket_1415
GROUP-3-GRID-CODE-AND-DISTRIBUTION-CODE.pptx
GROUP-3-GRID-CODE-AND-DISTRIBUTION-CODE.pptxGROUP-3-GRID-CODE-AND-DISTRIBUTION-CODE.pptx
GROUP-3-GRID-CODE-AND-DISTRIBUTION-CODE.pptx
meneememoo
Optimization of Cumulative Energy, Exergy Consumption and Environmental Life ...
Optimization of Cumulative Energy, Exergy Consumption and Environmental Life ...Optimization of Cumulative Energy, Exergy Consumption and Environmental Life ...
Optimization of Cumulative Energy, Exergy Consumption and Environmental Life ...
J. Agricultural Machinery
How to Make an RFID Door Lock System using Arduino
How to Make an RFID Door Lock System using ArduinoHow to Make an RFID Door Lock System using Arduino
How to Make an RFID Door Lock System using Arduino
CircuitDigest
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptxMathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
ppkmurthy2006
Frankfurt University of Applied Science urkunde
Frankfurt University of Applied Science urkundeFrankfurt University of Applied Science urkunde
Frankfurt University of Applied Science urkunde
Lisa Emerson
TM-ASP-101-RF_Air Press manual crimping machine.pdf
TM-ASP-101-RF_Air Press manual crimping machine.pdfTM-ASP-101-RF_Air Press manual crimping machine.pdf
TM-ASP-101-RF_Air Press manual crimping machine.pdf
ChungLe60
Piping-and-pipeline-calculations-manual.pdf
Piping-and-pipeline-calculations-manual.pdfPiping-and-pipeline-calculations-manual.pdf
Piping-and-pipeline-calculations-manual.pdf
OMI0721
Equipment for Gas Metal Arc Welding Process
Equipment for Gas Metal Arc Welding ProcessEquipment for Gas Metal Arc Welding Process
Equipment for Gas Metal Arc Welding Process
AhmadKamil87
Cyber Security_ Protecting the Digital World.pptx
Cyber Security_ Protecting the Digital World.pptxCyber Security_ Protecting the Digital World.pptx
Cyber Security_ Protecting the Digital World.pptx
Harshith A S

Supervised Learning Decision Trees Review of Entropy

  • 1. Sanjivani Rural Education Societys Sanjivani College of Engineering, Kopargaon-423 603 (An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune) NACC A Grade Accredited, ISO 9001:2015 Certified Department of Computer Engineering (NBA Accredited) Prof. S. A. Shivarkar Assistant Professor Contact No.8275032712 Email- shivarkarsandipcomp@sanjivani.org.in Subject- Supervised Modeling and AI Technologies (CO9401) Unit II: Supervised Learning Decision Trees
  • 2. Content Decision trees, Designing/Building of decision trees, Greedy algorithm, Decision tree algorithm selection algorithm, Constraints of decision tree algorithm, Use of Decision tree as a classifier as well as regressor, Attribute selection(Entropy, Information gain, GINI index)
  • 3. Decision Tree Induction: Training Dataset
  • 4. Decision Tree Induction: Training Dataset age? overcast student? credit rating? <=30 >40 no yes yes yes 31..40 fair excellent yes no Training data set: Buys_computer The data set follows an example of Quinlans ID3 (Playing Tennis) Resulting tree:
  • 5. Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf There are no samples left
  • 6. Brief Review of Entropy m = 2
  • 7. Brief Review of Entropy m = 2
  • 8. Attribute Selection Measure: Information Gain (ID3) Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D| Expected information (entropy) needed to classify a tuple in D: Information needed (after using A to split D into v partitions) to classify D: Information gained by branching on attribute A ) ( log ) ( 2 1 i m i i p p D Info ) ( | | | | ) ( 1 j v j j A D Info D D D Info (D) Info Info(D) Gain(A) A
  • 9. Decision Tree Induction: Training Dataset Example 1
  • 11. Decision Tree Induction: Training Dataset Example 2
  • 12. Computing Information-Gain for Continuous-Value Attributes Let attribute A be a continuous-valued attribute Must determine the best split point for A Sort the value A in increasing order Typically, the midpoint between each pair of adjacent values is considered as a possible split point (ai+ai+1)/2 is the midpoint between the values of ai and ai+1 The point with the minimum expected information requirement for A is selected as the split-point for A Split: D1 is the set of tuples in D satisfying A split-point, and D2 is the set of tuples in D satisfying A > split-point
  • 13. Gain Ratio for Attribute Selection (C4.5) Information gain measure is biased towards attributes with a large number of values C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain) GainRatio(A) = Gain(A)/SplitInfo(A) Ex. gain_ratio(income) = 0.029/1.557 = 0.019 The attribute with the maximum gain ratio is selected as the splitting attribute ) | | | | ( log | | | | ) ( 2 1 D D D D D SplitInfo j v j j A
  • 14. Attribute Selection (C4.5): Example 1 Department Age Salary Count Status sales 3135 4650 30 senior sales 2630 2630 40 junior sales 3135 3135 40 junior systems 2125 4650 20 junior systems 2131 6670 5 senior systems 2630 4650 3 junior systems 4145 6670 3 senior marketing 3640 4650 10 senior marketing 3135 4145 4 junior secretary 4650 3640 4 senior secretary 2630 2630 6 junior Training data from an employee
  • 15. Gini Index (CART, IBM IntelligentMiner) If a data set D contains examples from n classes, gini index, gini(D) is defined as where pj is the relative frequency of class j in D If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as Reduction in Impurity: The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute) n j p j D gini 1 2 1 ) ( ) ( | | | | ) ( | | | | ) ( 2 2 1 1 D gini D D D gini D D D giniA ) ( ) ( ) ( D gini D gini A gini A
  • 17. Computation of Gini Index Ex. D has 9 tuples in buys_computer = yes and 5 in no Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2 Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the {low,medium} (and {high}) since it has the lowest Gini index All attributes are assumed continuous-valued May need other tools, e.g., clustering, to get the possible split values Can be modified for categorical attributes 459 . 0 14 5 14 9 1 ) ( 2 2 D gini ) ( 14 4 ) ( 14 10 ) ( 2 1 } , { D Gini D Gini D gini medium low income
  • 18. Comparing Attribute Selection Measures The three measures, in general, return good results but Information gain: biased towards multivalued attributes Gain ratio: tends to prefer unbalanced splits in which one partition is much smaller than the others Gini index: biased to multivalued attributes has difficulty when # of classes is large tends to favor tests that result in equal-sized partitions and purity in both partitions
  • 19. Other Attribute Selection Measures CHAID: a popular decision tree algorithm, measure based on 2 test for independence C-SEP: performs better than info. gain and gini index in certain cases G-statistic: has a close approximation to 2 distribution MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred): The best tree as the one that requires the fewest # of bits to both (1) encode the tree, and (2) encode the exceptions to the tree Multivariate splits (partition based on multiple variable combinations) CART: finds multivariate splits based on a linear comb. of attrs. Which attribute selection measure is the best? Most give good results, none is significantly superior than others
  • 20. Overfitting: An induced tree may overfit the training data Model tries to accommodate all data points. Too many branches, some may reflect anomalies due to noise or outliers Poor accuracy for unseen samples A solution to avoid overfitting is using a linear algorithm if we have linear data or using the parameters like the maximal depth if we are using decision trees. Two approaches to avoid overfitting Prepruning: Halt tree construction early 無 do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a fully grown treeget a sequence of progressively pruned trees Use a set of data different from the training data to decide which is the best pruned tree Overfitting and Tree Pruning
  • 21. Underfitting: An induced tree may overfit the training data Model tries to accommodate very few data points e.g. 10% dataset for training and 90 % for testing. It has very less accuracy. An underfit models are inaccurate, especially when applied to new, unseen examples. Techniques to Reduce Underfitting Increase model complexity. Increase the number of features, performing feature engineering. Remove noise from the data. Increase the number of epochs or increase the duration of training to get better results. Overfitting and Tree Pruning
  • 22. Overfitting and Underfitting Reasons for Overfitting: 1. High variance and low bias. 2.The model is too complex. 3.The size of the training data. Reasons for Underfitting 1.If model not capable to represent the complexities in the data. 2.The size of the training dataset used is not enough. 3.Features are not scaled.
  • 24. Enhancements to Basic Decision Tree Induction Allow for continuous-valued attributes Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values Assign the most common value of the attribute Assign probability to each of the possible values Attribute construction Create new attributes based on existing ones that are sparsely represented This reduces fragmentation, repetition, and replication
  • 25. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 25 Reference Han, Jiawei Kamber, Micheline Pei and Jian, Data Mining: Concepts and Techniques,Elsevier Publishers, ISBN:9780123814791, 9780123814807. https://onlinecourses.nptel.ac.in/noc24_cs22 https://medium.com/analytics-vidhya/type-of-distances-used-in-machine- learning-algorithm-c873467140de https://www.freecodecamp.org/news/k-nearest-neighbors-algorithm- classifiers-and-model-example/