Issues in data mining Patterns Online Analytical ProcessingShivarkarSandip
油
This document contains lecture material on the introduction to data mining. It discusses several key topics:
- The different types of patterns that can be mined from data, including descriptive and predictive patterns.
- Common data mining techniques like characterization, discrimination, clustering, and frequent pattern mining.
- Key concepts in pattern mining including support, confidence, and the differences between data, information and knowledge.
- Analytical techniques used in data mining like online analytical processing (OLAP).
Introduction to data mining which covers the basicsShivarkarSandip
油
The document provides an overview of the key concepts in data mining and knowledge discovery. It defines data mining as the process of extracting useful patterns from large data sets. The document outlines the typical steps in the knowledge discovery process including data cleaning, transformation, mining, and interpretation. It describes different data mining techniques like classification, clustering, association rule mining etc. and common attribute types like nominal, binary, discrete vs continuous attributes. Overall, the document serves as an introductory primer on fundamental concepts in data mining.
This document provides an overview of data communication and networking concepts. It discusses how data communication is changing how business is conducted and how people live. It then covers various data communication topics like computer networks, technological advancements enabling faster communication, different modes of communication (local vs. remote), definitions of data and data communication, factors affecting effectiveness of communication systems, and modulation techniques for broadband transmission. It also discusses transmission media including guided media like twisted pair cable, coaxial cable and fiber optic cable, as well as unguided or wireless media like radio waves, microwaves and infrared signals.
1. A signal is defined as a function that represents variation in a physical quantity over time or space. Signals can be classified as continuous or discrete, deterministic or random, periodic or non-periodic, and more.
2. Analog signals are continuous over time while digital signals have discrete levels. Analog to digital conversion involves sampling, quantization, and encoding to represent analog signals digitally.
3. Sampling converts a continuous signal to a discrete signal by taking values at regular time intervals. Quantization maps infinite amplitude values to a finite set of values. Encoding represents each quantized value with a binary code.
This document discusses sequential circuit design using registers and shift registers. It begins by defining different types of registers including serial in serial out (SISO), serial in parallel out (SIPO), parallel in parallel out (PIPO), and parallel in serial out (PISO) registers. It then provides examples of writing to and reading from a SISO register. Applications of shift registers like sequence generators, sequence detectors, and parallel to serial converters are also covered. Specific sequential circuits like ring counters, Johnson counters, and sequence generators using JK flip-flops are designed and their operation is explained through state tables and diagrams. Finally, the differences between Mealy and Moore models for sequential circuit design are summarized.
This document discusses sequential circuit design and various types of flip-flops. It begins with explaining a one bit memory cell using two cross-coupled inverters. It then describes an SR latch and how adding Set and Reset inputs addresses issues with the basic latch. SR, JK, D, T, and clocked flip-flops are defined along with their truth tables. Implementation of different flip-flops from one another is covered. Counters using flip-flops are introduced as synchronous and asynchronous types. Preset and Clear inputs for flip-flops are explained. Finally, the relationship between the number of states in a counter and the number of flip-flops used is defined.
1. The document discusses various types of sequential circuits like latches, flip-flops, counters and their working.
2. It explains the working of SR latch, JK flip-flop, D flip-flop and T flip-flop. Excitation tables are provided for different types of flip-flops.
3. Asynchronous and synchronous counters are described along with a 2-bit ripple counter design using T flip-flops. Preset and clear inputs in flip-flops and their usage are also covered.
1. A multiplexer is a combinational circuit that selects one of several input lines and outputs the data on that line. It has multiple data inputs, a select line, and a single output. The value on the select line determines which input is directed to the output.
2. A demultiplexer is the reverse of a multiplexer. It has a single input and multiple outputs, with a select line determining which output the input data is directed to.
3. Combinational logic circuits like adders, comparators, and encoders can be designed using multiplexers and demultiplexers. Truth tables are used to derive the logic expressions implemented by the multiplexer/demultiplexer.
The document defines several key terms and concepts in Boolean algebra:
- Boolean algebra uses binary operations to represent logical operations and deals with true/false values.
- Boolean variables represent true/false values using 0 and 1.
- Boolean functions are formed using Boolean variables and operators.
- Truth tables show all combinations of input and output values, and are used to represent logic problems.
- Boolean algebra has basic laws like identity, commutative, associative, distributive, and De Morgan's laws.
The document discusses logic gates and their functions. It defines Boolean variables and logic operations. The key logic gates covered are NOT, AND, OR, NAND, NOR, XOR, and XNOR. For each gate, the document provides the logic symbol, Boolean expression, truth table, and briefly describes their functions in terms of input and output logic levels. The purpose is to understand the basic building blocks of digital circuits and how they perform logic operations.
This document discusses key concepts related to data warehouses. It defines a data warehouse as a repository of current and historical data used to support decision making. It notes that data warehouses are subject-oriented, integrated, time-variant, and nonvolatile. The document also discusses data marts, operational data stores, enterprise data warehouses, and the importance of metadata in data warehouses.
Unit II Decision Making Basics and Concepts.pdfShivarkarSandip
油
The document discusses concepts related to decision making and decision support systems. It covers:
- The definition of decision making as choosing between alternatives to achieve goals.
- Decision support systems as interactive computer applications combining data and models to help decision makers solve complex problems.
- Characteristics of decision making like groupthink, evaluating scenarios, experimentation challenges, and information issues.
- Decision styles in how people perceive and react to problems, including heuristic vs analytic and autocratic vs democratic styles.
- Simon's four phases of decision making and how the web has impacted these phases.
Unit I Factors Responsible for Successful BI Project.pdfShivarkarSandip
油
1) Three key factors for successful Business Intelligence (BI) projects are committed management support, a clear vision and business case, and a business-driven and iterative development approach.
2) Obstacles to successful BI projects include cultural and political issues like fragmented cultures and power struggles, as well as technological challenges like complex data integration.
3) The model identifies dimensions for success as organizational commitment, well-defined processes, and flexible technological frameworks that support business needs.
Unit I Operational data Informational data.pdfShivarkarSandip
油
Operational data refers to data used for daily business transactions like orders and inventory, while informational data involves collecting and compiling data to derive information. Operational systems are designed for real-time transactions with fast response times and small data volumes, whereas informational systems handle complex queries against large data volumes with slower response times. Key differences are that operational systems focus on processes and updating data, while informational systems focus on subjects and only reading data.
Business intelligence (BI) is defined as using mathematical models and analytical methods to generate insights from available data to support complex decision making. BI was first coined in the 1990s by Gartner Group and evolved from management information systems of the 1970s and executive information systems. By 2005, BI systems began including artificial intelligence and analytical capabilities. BI provides relevant reporting, key insights, ability to stay ahead of competitors, accurate data, improved customer satisfaction, understanding of growth patterns, efficiency, faster decision making, greater operational efficiency and revenue, and bigger profits.
Unit III Components and Arch of DWH ETL Process.pdfShivarkarSandip
油
BP Lubricants wanted to improve management information and business intelligence following a merger by integrating data from disparate source systems without delaying implementation of an ERP system. As part of its Business Intelligence and Global Standards program, BP Lubricants implemented a pilot of the Kalida adaptive enterprise data warehousing solution to prepare, implement, operate, and manage a data warehouse for consolidating data from its various sources. The data warehouse would provide consistent, transparent, and accessible management reporting and business intelligence across the merged organization.
Were excited to share our product profile, showcasing our expertise in Industrial Valves, Instrumentation, and Hydraulic & Pneumatic Solutions.
We also supply API-approved valves from globally trusted brands, ensuring top-notch quality and internationally certified solutions. Lets explore valuable business opportunities together!
We specialize in:
Industrial Valves (Gate, Globe, Ball, Butterfly, Check)
Instrumentation (Pressure Gauges, Transmitters, Flow Meters)
Pneumatic Products (Cylinders, Solenoid Valves, Fittings)
As authorized partners of trusted global brands, we deliver high-quality solutions tailored to meet your industrial needs with seamless support.
This presentation provides an in-depth analysis of structural quality control in the KRP 401600 section of the Copper Processing Plant-3 (MOF-3) in Uzbekistan. As a Structural QA/QC Inspector, I have identified critical welding defects, alignment issues, bolting problems, and joint fit-up concerns.
Key topics covered:
Common Structural Defects Welding porosity, misalignment, bolting errors, and more.
Root Cause Analysis Understanding why these defects occur.
Corrective & Preventive Actions Effective solutions to improve quality.
Team Responsibilities Roles of supervisors, welders, fitters, and QC inspectors.
Inspection & Quality Control Enhancements Advanced techniques for defect detection.
Applicable Standards: GOST, KMK, SNK Ensuring compliance with international quality benchmarks.
This presentation is a must-watch for:
QA/QC Inspectors, Structural Engineers, Welding Inspectors, and Project Managers in the construction & oil & gas industries.
Professionals looking to improve quality control processes in large-scale industrial projects.
Download & share your thoughts! Let's discuss best practices for enhancing structural integrity in industrial projects.
Categories:
Engineering
Construction
Quality Control
Welding Inspection
Project Management
Tags:
#QAQC #StructuralInspection #WeldingDefects #BoltingIssues #ConstructionQuality #Engineering #GOSTStandards #WeldingInspection #QualityControl #ProjectManagement #MOF3 #CopperProcessing #StructuralEngineering #NDT #OilAndGas
Issues in data mining Patterns Online Analytical ProcessingShivarkarSandip
油
This document contains lecture material on the introduction to data mining. It discusses several key topics:
- The different types of patterns that can be mined from data, including descriptive and predictive patterns.
- Common data mining techniques like characterization, discrimination, clustering, and frequent pattern mining.
- Key concepts in pattern mining including support, confidence, and the differences between data, information and knowledge.
- Analytical techniques used in data mining like online analytical processing (OLAP).
Introduction to data mining which covers the basicsShivarkarSandip
油
The document provides an overview of the key concepts in data mining and knowledge discovery. It defines data mining as the process of extracting useful patterns from large data sets. The document outlines the typical steps in the knowledge discovery process including data cleaning, transformation, mining, and interpretation. It describes different data mining techniques like classification, clustering, association rule mining etc. and common attribute types like nominal, binary, discrete vs continuous attributes. Overall, the document serves as an introductory primer on fundamental concepts in data mining.
This document provides an overview of data communication and networking concepts. It discusses how data communication is changing how business is conducted and how people live. It then covers various data communication topics like computer networks, technological advancements enabling faster communication, different modes of communication (local vs. remote), definitions of data and data communication, factors affecting effectiveness of communication systems, and modulation techniques for broadband transmission. It also discusses transmission media including guided media like twisted pair cable, coaxial cable and fiber optic cable, as well as unguided or wireless media like radio waves, microwaves and infrared signals.
1. A signal is defined as a function that represents variation in a physical quantity over time or space. Signals can be classified as continuous or discrete, deterministic or random, periodic or non-periodic, and more.
2. Analog signals are continuous over time while digital signals have discrete levels. Analog to digital conversion involves sampling, quantization, and encoding to represent analog signals digitally.
3. Sampling converts a continuous signal to a discrete signal by taking values at regular time intervals. Quantization maps infinite amplitude values to a finite set of values. Encoding represents each quantized value with a binary code.
This document discusses sequential circuit design using registers and shift registers. It begins by defining different types of registers including serial in serial out (SISO), serial in parallel out (SIPO), parallel in parallel out (PIPO), and parallel in serial out (PISO) registers. It then provides examples of writing to and reading from a SISO register. Applications of shift registers like sequence generators, sequence detectors, and parallel to serial converters are also covered. Specific sequential circuits like ring counters, Johnson counters, and sequence generators using JK flip-flops are designed and their operation is explained through state tables and diagrams. Finally, the differences between Mealy and Moore models for sequential circuit design are summarized.
This document discusses sequential circuit design and various types of flip-flops. It begins with explaining a one bit memory cell using two cross-coupled inverters. It then describes an SR latch and how adding Set and Reset inputs addresses issues with the basic latch. SR, JK, D, T, and clocked flip-flops are defined along with their truth tables. Implementation of different flip-flops from one another is covered. Counters using flip-flops are introduced as synchronous and asynchronous types. Preset and Clear inputs for flip-flops are explained. Finally, the relationship between the number of states in a counter and the number of flip-flops used is defined.
1. The document discusses various types of sequential circuits like latches, flip-flops, counters and their working.
2. It explains the working of SR latch, JK flip-flop, D flip-flop and T flip-flop. Excitation tables are provided for different types of flip-flops.
3. Asynchronous and synchronous counters are described along with a 2-bit ripple counter design using T flip-flops. Preset and clear inputs in flip-flops and their usage are also covered.
1. A multiplexer is a combinational circuit that selects one of several input lines and outputs the data on that line. It has multiple data inputs, a select line, and a single output. The value on the select line determines which input is directed to the output.
2. A demultiplexer is the reverse of a multiplexer. It has a single input and multiple outputs, with a select line determining which output the input data is directed to.
3. Combinational logic circuits like adders, comparators, and encoders can be designed using multiplexers and demultiplexers. Truth tables are used to derive the logic expressions implemented by the multiplexer/demultiplexer.
The document defines several key terms and concepts in Boolean algebra:
- Boolean algebra uses binary operations to represent logical operations and deals with true/false values.
- Boolean variables represent true/false values using 0 and 1.
- Boolean functions are formed using Boolean variables and operators.
- Truth tables show all combinations of input and output values, and are used to represent logic problems.
- Boolean algebra has basic laws like identity, commutative, associative, distributive, and De Morgan's laws.
The document discusses logic gates and their functions. It defines Boolean variables and logic operations. The key logic gates covered are NOT, AND, OR, NAND, NOR, XOR, and XNOR. For each gate, the document provides the logic symbol, Boolean expression, truth table, and briefly describes their functions in terms of input and output logic levels. The purpose is to understand the basic building blocks of digital circuits and how they perform logic operations.
This document discusses key concepts related to data warehouses. It defines a data warehouse as a repository of current and historical data used to support decision making. It notes that data warehouses are subject-oriented, integrated, time-variant, and nonvolatile. The document also discusses data marts, operational data stores, enterprise data warehouses, and the importance of metadata in data warehouses.
Unit II Decision Making Basics and Concepts.pdfShivarkarSandip
油
The document discusses concepts related to decision making and decision support systems. It covers:
- The definition of decision making as choosing between alternatives to achieve goals.
- Decision support systems as interactive computer applications combining data and models to help decision makers solve complex problems.
- Characteristics of decision making like groupthink, evaluating scenarios, experimentation challenges, and information issues.
- Decision styles in how people perceive and react to problems, including heuristic vs analytic and autocratic vs democratic styles.
- Simon's four phases of decision making and how the web has impacted these phases.
Unit I Factors Responsible for Successful BI Project.pdfShivarkarSandip
油
1) Three key factors for successful Business Intelligence (BI) projects are committed management support, a clear vision and business case, and a business-driven and iterative development approach.
2) Obstacles to successful BI projects include cultural and political issues like fragmented cultures and power struggles, as well as technological challenges like complex data integration.
3) The model identifies dimensions for success as organizational commitment, well-defined processes, and flexible technological frameworks that support business needs.
Unit I Operational data Informational data.pdfShivarkarSandip
油
Operational data refers to data used for daily business transactions like orders and inventory, while informational data involves collecting and compiling data to derive information. Operational systems are designed for real-time transactions with fast response times and small data volumes, whereas informational systems handle complex queries against large data volumes with slower response times. Key differences are that operational systems focus on processes and updating data, while informational systems focus on subjects and only reading data.
Business intelligence (BI) is defined as using mathematical models and analytical methods to generate insights from available data to support complex decision making. BI was first coined in the 1990s by Gartner Group and evolved from management information systems of the 1970s and executive information systems. By 2005, BI systems began including artificial intelligence and analytical capabilities. BI provides relevant reporting, key insights, ability to stay ahead of competitors, accurate data, improved customer satisfaction, understanding of growth patterns, efficiency, faster decision making, greater operational efficiency and revenue, and bigger profits.
Unit III Components and Arch of DWH ETL Process.pdfShivarkarSandip
油
BP Lubricants wanted to improve management information and business intelligence following a merger by integrating data from disparate source systems without delaying implementation of an ERP system. As part of its Business Intelligence and Global Standards program, BP Lubricants implemented a pilot of the Kalida adaptive enterprise data warehousing solution to prepare, implement, operate, and manage a data warehouse for consolidating data from its various sources. The data warehouse would provide consistent, transparent, and accessible management reporting and business intelligence across the merged organization.
Were excited to share our product profile, showcasing our expertise in Industrial Valves, Instrumentation, and Hydraulic & Pneumatic Solutions.
We also supply API-approved valves from globally trusted brands, ensuring top-notch quality and internationally certified solutions. Lets explore valuable business opportunities together!
We specialize in:
Industrial Valves (Gate, Globe, Ball, Butterfly, Check)
Instrumentation (Pressure Gauges, Transmitters, Flow Meters)
Pneumatic Products (Cylinders, Solenoid Valves, Fittings)
As authorized partners of trusted global brands, we deliver high-quality solutions tailored to meet your industrial needs with seamless support.
This presentation provides an in-depth analysis of structural quality control in the KRP 401600 section of the Copper Processing Plant-3 (MOF-3) in Uzbekistan. As a Structural QA/QC Inspector, I have identified critical welding defects, alignment issues, bolting problems, and joint fit-up concerns.
Key topics covered:
Common Structural Defects Welding porosity, misalignment, bolting errors, and more.
Root Cause Analysis Understanding why these defects occur.
Corrective & Preventive Actions Effective solutions to improve quality.
Team Responsibilities Roles of supervisors, welders, fitters, and QC inspectors.
Inspection & Quality Control Enhancements Advanced techniques for defect detection.
Applicable Standards: GOST, KMK, SNK Ensuring compliance with international quality benchmarks.
This presentation is a must-watch for:
QA/QC Inspectors, Structural Engineers, Welding Inspectors, and Project Managers in the construction & oil & gas industries.
Professionals looking to improve quality control processes in large-scale industrial projects.
Download & share your thoughts! Let's discuss best practices for enhancing structural integrity in industrial projects.
Categories:
Engineering
Construction
Quality Control
Welding Inspection
Project Management
Tags:
#QAQC #StructuralInspection #WeldingDefects #BoltingIssues #ConstructionQuality #Engineering #GOSTStandards #WeldingInspection #QualityControl #ProjectManagement #MOF3 #CopperProcessing #StructuralEngineering #NDT #OilAndGas
Best KNow Hydrogen Fuel Production in the World The cost in USD kwh for H2Daniel Donatelli
油
The cost in USD/kwh for H2
Daniel Donatelli
Secure Supplies Group
Index
Introduction - Page 3
The Need for Hydrogen Fueling - Page 5
Pure H2 Fueling Technology - Page 7
Blend Gas Fueling: A Transition Strategy - Page 10
Performance Metrics: H2 vs. Fossil Fuels - Page 12
Cost Analysis and Economic Viability - Page 15
Innovations Driving Leadership - Page 18
Laminar Flame Speed Adjustment
Heat Management Systems
The Donatelli Cycle
Non-Carnot Cycle Applications
Case Studies and Real-World Applications - Page 22
Conclusion: Secure Supplies Leadership in Hydrogen Fueling - Page 27
Lecture -3 Cold water supply system.pptxrabiaatif2
油
The presentation on Cold Water Supply explored the fundamental principles of water distribution in buildings. It covered sources of cold water, including municipal supply, wells, and rainwater harvesting. Key components such as storage tanks, pipes, valves, and pumps were discussed for efficient water delivery. Various distribution systems, including direct and indirect supply methods, were analyzed for residential and commercial applications. The presentation emphasized water quality, pressure regulation, and contamination prevention. Common issues like pipe corrosion, leaks, and pressure drops were addressed along with maintenance strategies. Diagrams and case studies illustrated system layouts and best practices for optimal performance.
Indian Soil Classification System in Geotechnical EngineeringRajani Vyawahare
油
This PowerPoint presentation provides a comprehensive overview of the Indian Soil Classification System, widely used in geotechnical engineering for identifying and categorizing soils based on their properties. It covers essential aspects such as particle size distribution, sieve analysis, and Atterberg consistency limits, which play a crucial role in determining soil behavior for construction and foundation design. The presentation explains the classification of soil based on particle size, including gravel, sand, silt, and clay, and details the sieve analysis experiment used to determine grain size distribution. Additionally, it explores the Atterberg consistency limits, such as the liquid limit, plastic limit, and shrinkage limit, along with a plasticity chart to assess soil plasticity and its impact on engineering applications. Furthermore, it discusses the Indian Standard Soil Classification (IS 1498:1970) and its significance in construction, along with a comparison to the Unified Soil Classification System (USCS). With detailed explanations, graphs, charts, and practical applications, this presentation serves as a valuable resource for students, civil engineers, and researchers in the field of geotechnical engineering.
Optimization of Cumulative Energy, Exergy Consumption and Environmental Life ...J. Agricultural Machinery
油
Optimal use of resources, including energy, is one of the most important principles in modern and sustainable agricultural systems. Exergy analysis and life cycle assessment were used to study the efficient use of inputs, energy consumption reduction, and various environmental effects in the corn production system in Lorestan province, Iran. The required data were collected from farmers in Lorestan province using random sampling. The Cobb-Douglas equation and data envelopment analysis were utilized for modeling and optimizing cumulative energy and exergy consumption (CEnC and CExC) and devising strategies to mitigate the environmental impacts of corn production. The Cobb-Douglas equation results revealed that electricity, diesel fuel, and N-fertilizer were the major contributors to CExC in the corn production system. According to the Data Envelopment Analysis (DEA) results, the average efficiency of all farms in terms of CExC was 94.7% in the CCR model and 97.8% in the BCC model. Furthermore, the results indicated that there was excessive consumption of inputs, particularly potassium and phosphate fertilizers. By adopting more suitable methods based on DEA of efficient farmers, it was possible to save 6.47, 10.42, 7.40, 13.32, 31.29, 3.25, and 6.78% in the exergy consumption of diesel fuel, electricity, machinery, chemical fertilizers, biocides, seeds, and irrigation, respectively.
Cyber Security_ Protecting the Digital World.pptxHarshith A S
油
Supervised Learning Decision Trees Review of Entropy
1. Sanjivani Rural Education Societys
Sanjivani College of Engineering, Kopargaon-423 603
(An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune)
NACC A Grade Accredited, ISO 9001:2015 Certified
Department of Computer Engineering
(NBA Accredited)
Prof. S. A. Shivarkar
Assistant Professor
Contact No.8275032712
Email- shivarkarsandipcomp@sanjivani.org.in
Subject- Supervised Modeling and AI Technologies (CO9401)
Unit II: Supervised Learning Decision Trees
2. Content
Decision trees, Designing/Building of decision trees, Greedy algorithm,
Decision tree algorithm selection algorithm, Constraints of decision tree
algorithm, Use of Decision tree as a classifier as well as regressor,
Attribute selection(Entropy, Information gain, GINI index)
4. Decision Tree Induction: Training Dataset
age?
overcast
student? credit rating?
<=30 >40
no yes yes
yes
31..40
fair
excellent
yes
no
Training data set: Buys_computer
The data set follows an example of
Quinlans ID3 (Playing Tennis)
Resulting tree:
5. Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are discretized in
advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical measure
(e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning majority voting
is employed for classifying the leaf
There are no samples left
8. Attribute Selection Measure: Information Gain (ID3)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated
by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple in D:
Information needed (after using A to split D into v partitions) to classify D:
Information gained by branching on attribute A
)
(
log
)
( 2
1
i
m
i
i p
p
D
Info
)
(
|
|
|
|
)
(
1
j
v
j
j
A D
Info
D
D
D
Info
(D)
Info
Info(D)
Gain(A) A
12. Computing Information-Gain for Continuous-Value Attributes
Let attribute A be a continuous-valued attribute
Must determine the best split point for A
Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent values is considered as a
possible split point
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information requirement for A is
selected as the split-point for A
Split:
D1 is the set of tuples in D satisfying A split-point, and D2 is the set of tuples
in D satisfying A > split-point
13. Gain Ratio for Attribute Selection (C4.5)
Information gain measure is biased towards attributes with a
large number of values
C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
gain_ratio(income) = 0.029/1.557 = 0.019
The attribute with the maximum gain ratio is selected as the
splitting attribute
)
|
|
|
|
(
log
|
|
|
|
)
( 2
1 D
D
D
D
D
SplitInfo j
v
j
j
A
14. Attribute Selection (C4.5): Example 1
Department Age Salary Count Status
sales 3135 4650 30 senior
sales 2630 2630 40 junior
sales 3135 3135 40 junior
systems 2125 4650 20 junior
systems 2131 6670 5 senior
systems 2630 4650 3 junior
systems 4145 6670 3 senior
marketing 3640 4650 10 senior
marketing 3135 4145 4 junior
secretary 4650 3640 4 senior
secretary 2630 2630 6 junior
Training
data
from an
employee
15. Gini Index (CART, IBM IntelligentMiner)
If a data set D contains examples from n classes, gini index, gini(D) is defined as
where pj is the relative frequency of class j in D
If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is
defined as
Reduction in Impurity:
The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity)
is chosen to split the node (need to enumerate all the possible splitting points for
each attribute)
n
j
p j
D
gini
1
2
1
)
(
)
(
|
|
|
|
)
(
|
|
|
|
)
( 2
2
1
1
D
gini
D
D
D
gini
D
D
D
giniA
)
(
)
(
)
( D
gini
D
gini
A
gini A
17. Computation of Gini Index
Ex. D has 9 tuples in buys_computer = yes and 5 in no
Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2
Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the {low,medium} (and
{high}) since it has the lowest Gini index
All attributes are assumed continuous-valued
May need other tools, e.g., clustering, to get the possible split values
Can be modified for categorical attributes
459
.
0
14
5
14
9
1
)
(
2
2
D
gini
)
(
14
4
)
(
14
10
)
( 2
1
}
,
{ D
Gini
D
Gini
D
gini medium
low
income
18. Comparing Attribute Selection Measures
The three measures, in general, return good results but
Information gain:
biased towards multivalued attributes
Gain ratio:
tends to prefer unbalanced splits in which one partition is much smaller than the
others
Gini index:
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal-sized partitions and purity in both
partitions
19. Other Attribute Selection Measures
CHAID: a popular decision tree algorithm, measure based on 2 test for independence
C-SEP: performs better than info. gain and gini index in certain cases
G-statistic: has a close approximation to 2 distribution
MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred):
The best tree as the one that requires the fewest # of bits to both (1) encode the tree,
and (2) encode the exceptions to the tree
Multivariate splits (partition based on multiple variable combinations)
CART: finds multivariate splits based on a linear comb. of attrs.
Which attribute selection measure is the best?
Most give good results, none is significantly superior than others
20. Overfitting: An induced tree may overfit the training data
Model tries to accommodate all data points.
Too many branches, some may reflect anomalies due to noise or outliers
Poor accuracy for unseen samples
A solution to avoid overfitting is using a linear algorithm if we have linear data or
using the parameters like the maximal depth if we are using decision trees.
Two approaches to avoid overfitting
Prepruning: Halt tree construction early 無 do not split a node if this would result in the goodness
measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a fully grown treeget a sequence of progressively
pruned trees
Use a set of data different from the training data to decide which is the best pruned tree
Overfitting and Tree Pruning
21. Underfitting: An induced tree may overfit the training data
Model tries to accommodate very few data points e.g. 10% dataset for training and 90 % for
testing.
It has very less accuracy.
An underfit models are inaccurate, especially when applied to new,
unseen examples.
Techniques to Reduce Underfitting
Increase model complexity.
Increase the number of features, performing feature engineering.
Remove noise from the data.
Increase the number of epochs or increase the duration of training to get better results.
Overfitting and Tree Pruning
22. Overfitting and Underfitting
Reasons for Overfitting:
1. High variance and low bias.
2.The model is too complex.
3.The size of the training data.
Reasons for Underfitting
1.If model not capable to represent the complexities in the data.
2.The size of the training dataset used is not enough.
3.Features are not scaled.
24. Enhancements to Basic Decision Tree Induction
Allow for continuous-valued attributes
Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of
intervals
Handle missing attribute values
Assign the most common value of the attribute
Assign probability to each of the possible values
Attribute construction
Create new attributes based on existing ones that are
sparsely represented
This reduces fragmentation, repetition, and replication
25. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 25
Reference
Han, Jiawei Kamber, Micheline Pei and Jian, Data Mining: Concepts and
Techniques,Elsevier Publishers, ISBN:9780123814791, 9780123814807.
https://onlinecourses.nptel.ac.in/noc24_cs22
https://medium.com/analytics-vidhya/type-of-distances-used-in-machine-
learning-algorithm-c873467140de
https://www.freecodecamp.org/news/k-nearest-neighbors-algorithm-
classifiers-and-model-example/