Issues in data mining Patterns Online Analytical ProcessingShivarkarSandipThis document contains lecture material on the introduction to data mining. It discusses several key topics:
- The different types of patterns that can be mined from data, including descriptive and predictive patterns.
- Common data mining techniques like characterization, discrimination, clustering, and frequent pattern mining.
- Key concepts in pattern mining including support, confidence, and the differences between data, information and knowledge.
- Analytical techniques used in data mining like online analytical processing (OLAP).
Introduction to data mining which covers the basicsShivarkarSandipThe document provides an overview of the key concepts in data mining and knowledge discovery. It defines data mining as the process of extracting useful patterns from large data sets. The document outlines the typical steps in the knowledge discovery process including data cleaning, transformation, mining, and interpretation. It describes different data mining techniques like classification, clustering, association rule mining etc. and common attribute types like nominal, binary, discrete vs continuous attributes. Overall, the document serves as an introductory primer on fundamental concepts in data mining.
Introduction to Data Communication.pdfShivarkarSandipThis document provides an overview of data communication and networking concepts. It discusses how data communication is changing how business is conducted and how people live. It then covers various data communication topics like computer networks, technological advancements enabling faster communication, different modes of communication (local vs. remote), definitions of data and data communication, factors affecting effectiveness of communication systems, and modulation techniques for broadband transmission. It also discusses transmission media including guided media like twisted pair cable, coaxial cable and fiber optic cable, as well as unguided or wireless media like radio waves, microwaves and infrared signals.
Classification of Signal.pdfShivarkarSandip1. A signal is defined as a function that represents variation in a physical quantity over time or space. Signals can be classified as continuous or discrete, deterministic or random, periodic or non-periodic, and more.
2. Analog signals are continuous over time while digital signals have discrete levels. Analog to digital conversion involves sampling, quantization, and encoding to represent analog signals digitally.
3. Sampling converts a continuous signal to a discrete signal by taking values at regular time intervals. Quantization maps infinite amplitude values to a finite set of values. Encoding represents each quantized value with a binary code.
Sequential Circuit Design-2.pdfShivarkarSandipThis document discusses sequential circuit design using registers and shift registers. It begins by defining different types of registers including serial in serial out (SISO), serial in parallel out (SIPO), parallel in parallel out (PIPO), and parallel in serial out (PISO) registers. It then provides examples of writing to and reading from a SISO register. Applications of shift registers like sequence generators, sequence detectors, and parallel to serial converters are also covered. Specific sequential circuits like ring counters, Johnson counters, and sequence generators using JK flip-flops are designed and their operation is explained through state tables and diagrams. Finally, the differences between Mealy and Moore models for sequential circuit design are summarized.
Sequential Ckt.pdfShivarkarSandipThis document discusses sequential circuit design and various types of flip-flops. It begins with explaining a one bit memory cell using two cross-coupled inverters. It then describes an SR latch and how adding Set and Reset inputs addresses issues with the basic latch. SR, JK, D, T, and clocked flip-flops are defined along with their truth tables. Implementation of different flip-flops from one another is covered. Counters using flip-flops are introduced as synchronous and asynchronous types. Preset and Clear inputs for flip-flops are explained. Finally, the relationship between the number of states in a counter and the number of flip-flops used is defined.
Flip Flop.pdfShivarkarSandip1. The document discusses various types of sequential circuits like latches, flip-flops, counters and their working.
2. It explains the working of SR latch, JK flip-flop, D flip-flop and T flip-flop. Excitation tables are provided for different types of flip-flops.
3. Asynchronous and synchronous counters are described along with a 2-bit ripple counter design using T flip-flops. Preset and clear inputs in flip-flops and their usage are also covered.
Combinational Ckt.pdfShivarkarSandip1. A multiplexer is a combinational circuit that selects one of several input lines and outputs the data on that line. It has multiple data inputs, a select line, and a single output. The value on the select line determines which input is directed to the output.
2. A demultiplexer is the reverse of a multiplexer. It has a single input and multiple outputs, with a select line determining which output the input data is directed to.
3. Combinational logic circuits like adders, comparators, and encoders can be designed using multiplexers and demultiplexers. Truth tables are used to derive the logic expressions implemented by the multiplexer/demultiplexer.
Boolean Algebra Terminologies.pdfShivarkarSandipThe document defines several key terms and concepts in Boolean algebra:
- Boolean algebra uses binary operations to represent logical operations and deals with true/false values.
- Boolean variables represent true/false values using 0 and 1.
- Boolean functions are formed using Boolean variables and operators.
- Truth tables show all combinations of input and output values, and are used to represent logic problems.
- Boolean algebra has basic laws like identity, commutative, associative, distributive, and De Morgan's laws.
Logic Minimization.pdfShivarkarSandipThe document discusses logic gates and their functions. It defines Boolean variables and logic operations. The key logic gates covered are NOT, AND, OR, NAND, NOR, XOR, and XNOR. For each gate, the document provides the logic symbol, Boolean expression, truth table, and briefly describes their functions in terms of input and output logic levels. The purpose is to understand the basic building blocks of digital circuits and how they perform logic operations.
Unit III Introduction to DWH.pdfShivarkarSandipThis document discusses key concepts related to data warehouses. It defines a data warehouse as a repository of current and historical data used to support decision making. It notes that data warehouses are subject-oriented, integrated, time-variant, and nonvolatile. The document also discusses data marts, operational data stores, enterprise data warehouses, and the importance of metadata in data warehouses.
Unit II Decision Making Basics and Concepts.pdfShivarkarSandipThe document discusses concepts related to decision making and decision support systems. It covers:
- The definition of decision making as choosing between alternatives to achieve goals.
- Decision support systems as interactive computer applications combining data and models to help decision makers solve complex problems.
- Characteristics of decision making like groupthink, evaluating scenarios, experimentation challenges, and information issues.
- Decision styles in how people perceive and react to problems, including heuristic vs analytic and autocratic vs democratic styles.
- Simon's four phases of decision making and how the web has impacted these phases.
Unit I Factors Responsible for Successful BI Project.pdfShivarkarSandip1) Three key factors for successful Business Intelligence (BI) projects are committed management support, a clear vision and business case, and a business-driven and iterative development approach.
2) Obstacles to successful BI projects include cultural and political issues like fragmented cultures and power struggles, as well as technological challenges like complex data integration.
3) The model identifies dimensions for success as organizational commitment, well-defined processes, and flexible technological frameworks that support business needs.
Unit I Operational data Informational data.pdfShivarkarSandipOperational data refers to data used for daily business transactions like orders and inventory, while informational data involves collecting and compiling data to derive information. Operational systems are designed for real-time transactions with fast response times and small data volumes, whereas informational systems handle complex queries against large data volumes with slower response times. Key differences are that operational systems focus on processes and updating data, while informational systems focus on subjects and only reading data.
Unit I Introduction to BI.pdfShivarkarSandipBusiness intelligence (BI) is defined as using mathematical models and analytical methods to generate insights from available data to support complex decision making. BI was first coined in the 1990s by Gartner Group and evolved from management information systems of the 1970s and executive information systems. By 2005, BI systems began including artificial intelligence and analytical capabilities. BI provides relevant reporting, key insights, ability to stay ahead of competitors, accurate data, improved customer satisfaction, understanding of growth patterns, efficiency, faster decision making, greater operational efficiency and revenue, and bigger profits.
Unit III Components and Arch of DWH ETL Process.pdfShivarkarSandipBP Lubricants wanted to improve management information and business intelligence following a merger by integrating data from disparate source systems without delaying implementation of an ERP system. As part of its Business Intelligence and Global Standards program, BP Lubricants implemented a pilot of the Kalida adaptive enterprise data warehousing solution to prepare, implement, operate, and manage a data warehouse for consolidating data from its various sources. The data warehouse would provide consistent, transparent, and accessible management reporting and business intelligence across the merged organization.
Air pollution is contamination of the indoor or outdoor environment by any ch...dhanashree78Air pollution is contamination of the indoor or outdoor environment by any chemical, physical or biological agent that modifies the natural characteristics of the atmosphere.
Household combustion devices, motor vehicles, industrial facilities and forest fires are common sources of air pollution. Pollutants of major public health concern include particulate matter, carbon monoxide, ozone, nitrogen dioxide and sulfur dioxide. Outdoor and indoor air pollution cause respiratory and other diseases and are important sources of morbidity and mortality.
WHO data show that almost all of the global population (99%) breathe air that exceeds WHO guideline limits and contains high levels of pollutants, with low- and middle-income countries suffering from the highest exposures.
Air quality is closely linked to the earth’s climate and ecosystems globally. Many of the drivers of air pollution (i.e. combustion of fossil fuels) are also sources of greenhouse gas emissions. Policies to reduce air pollution, therefore, offer a win-win strategy for both climate and health, lowering the burden of disease attributable to air pollution, as well as contributing to the near- and long-term mitigation of climate change.
Issues in data mining Patterns Online Analytical ProcessingShivarkarSandipThis document contains lecture material on the introduction to data mining. It discusses several key topics:
- The different types of patterns that can be mined from data, including descriptive and predictive patterns.
- Common data mining techniques like characterization, discrimination, clustering, and frequent pattern mining.
- Key concepts in pattern mining including support, confidence, and the differences between data, information and knowledge.
- Analytical techniques used in data mining like online analytical processing (OLAP).
Introduction to data mining which covers the basicsShivarkarSandipThe document provides an overview of the key concepts in data mining and knowledge discovery. It defines data mining as the process of extracting useful patterns from large data sets. The document outlines the typical steps in the knowledge discovery process including data cleaning, transformation, mining, and interpretation. It describes different data mining techniques like classification, clustering, association rule mining etc. and common attribute types like nominal, binary, discrete vs continuous attributes. Overall, the document serves as an introductory primer on fundamental concepts in data mining.
Introduction to Data Communication.pdfShivarkarSandipThis document provides an overview of data communication and networking concepts. It discusses how data communication is changing how business is conducted and how people live. It then covers various data communication topics like computer networks, technological advancements enabling faster communication, different modes of communication (local vs. remote), definitions of data and data communication, factors affecting effectiveness of communication systems, and modulation techniques for broadband transmission. It also discusses transmission media including guided media like twisted pair cable, coaxial cable and fiber optic cable, as well as unguided or wireless media like radio waves, microwaves and infrared signals.
Classification of Signal.pdfShivarkarSandip1. A signal is defined as a function that represents variation in a physical quantity over time or space. Signals can be classified as continuous or discrete, deterministic or random, periodic or non-periodic, and more.
2. Analog signals are continuous over time while digital signals have discrete levels. Analog to digital conversion involves sampling, quantization, and encoding to represent analog signals digitally.
3. Sampling converts a continuous signal to a discrete signal by taking values at regular time intervals. Quantization maps infinite amplitude values to a finite set of values. Encoding represents each quantized value with a binary code.
Sequential Circuit Design-2.pdfShivarkarSandipThis document discusses sequential circuit design using registers and shift registers. It begins by defining different types of registers including serial in serial out (SISO), serial in parallel out (SIPO), parallel in parallel out (PIPO), and parallel in serial out (PISO) registers. It then provides examples of writing to and reading from a SISO register. Applications of shift registers like sequence generators, sequence detectors, and parallel to serial converters are also covered. Specific sequential circuits like ring counters, Johnson counters, and sequence generators using JK flip-flops are designed and their operation is explained through state tables and diagrams. Finally, the differences between Mealy and Moore models for sequential circuit design are summarized.
Sequential Ckt.pdfShivarkarSandipThis document discusses sequential circuit design and various types of flip-flops. It begins with explaining a one bit memory cell using two cross-coupled inverters. It then describes an SR latch and how adding Set and Reset inputs addresses issues with the basic latch. SR, JK, D, T, and clocked flip-flops are defined along with their truth tables. Implementation of different flip-flops from one another is covered. Counters using flip-flops are introduced as synchronous and asynchronous types. Preset and Clear inputs for flip-flops are explained. Finally, the relationship between the number of states in a counter and the number of flip-flops used is defined.
Flip Flop.pdfShivarkarSandip1. The document discusses various types of sequential circuits like latches, flip-flops, counters and their working.
2. It explains the working of SR latch, JK flip-flop, D flip-flop and T flip-flop. Excitation tables are provided for different types of flip-flops.
3. Asynchronous and synchronous counters are described along with a 2-bit ripple counter design using T flip-flops. Preset and clear inputs in flip-flops and their usage are also covered.
Combinational Ckt.pdfShivarkarSandip1. A multiplexer is a combinational circuit that selects one of several input lines and outputs the data on that line. It has multiple data inputs, a select line, and a single output. The value on the select line determines which input is directed to the output.
2. A demultiplexer is the reverse of a multiplexer. It has a single input and multiple outputs, with a select line determining which output the input data is directed to.
3. Combinational logic circuits like adders, comparators, and encoders can be designed using multiplexers and demultiplexers. Truth tables are used to derive the logic expressions implemented by the multiplexer/demultiplexer.
Boolean Algebra Terminologies.pdfShivarkarSandipThe document defines several key terms and concepts in Boolean algebra:
- Boolean algebra uses binary operations to represent logical operations and deals with true/false values.
- Boolean variables represent true/false values using 0 and 1.
- Boolean functions are formed using Boolean variables and operators.
- Truth tables show all combinations of input and output values, and are used to represent logic problems.
- Boolean algebra has basic laws like identity, commutative, associative, distributive, and De Morgan's laws.
Logic Minimization.pdfShivarkarSandipThe document discusses logic gates and their functions. It defines Boolean variables and logic operations. The key logic gates covered are NOT, AND, OR, NAND, NOR, XOR, and XNOR. For each gate, the document provides the logic symbol, Boolean expression, truth table, and briefly describes their functions in terms of input and output logic levels. The purpose is to understand the basic building blocks of digital circuits and how they perform logic operations.
Unit III Introduction to DWH.pdfShivarkarSandipThis document discusses key concepts related to data warehouses. It defines a data warehouse as a repository of current and historical data used to support decision making. It notes that data warehouses are subject-oriented, integrated, time-variant, and nonvolatile. The document also discusses data marts, operational data stores, enterprise data warehouses, and the importance of metadata in data warehouses.
Unit II Decision Making Basics and Concepts.pdfShivarkarSandipThe document discusses concepts related to decision making and decision support systems. It covers:
- The definition of decision making as choosing between alternatives to achieve goals.
- Decision support systems as interactive computer applications combining data and models to help decision makers solve complex problems.
- Characteristics of decision making like groupthink, evaluating scenarios, experimentation challenges, and information issues.
- Decision styles in how people perceive and react to problems, including heuristic vs analytic and autocratic vs democratic styles.
- Simon's four phases of decision making and how the web has impacted these phases.
Unit I Factors Responsible for Successful BI Project.pdfShivarkarSandip1) Three key factors for successful Business Intelligence (BI) projects are committed management support, a clear vision and business case, and a business-driven and iterative development approach.
2) Obstacles to successful BI projects include cultural and political issues like fragmented cultures and power struggles, as well as technological challenges like complex data integration.
3) The model identifies dimensions for success as organizational commitment, well-defined processes, and flexible technological frameworks that support business needs.
Unit I Operational data Informational data.pdfShivarkarSandipOperational data refers to data used for daily business transactions like orders and inventory, while informational data involves collecting and compiling data to derive information. Operational systems are designed for real-time transactions with fast response times and small data volumes, whereas informational systems handle complex queries against large data volumes with slower response times. Key differences are that operational systems focus on processes and updating data, while informational systems focus on subjects and only reading data.
Unit I Introduction to BI.pdfShivarkarSandipBusiness intelligence (BI) is defined as using mathematical models and analytical methods to generate insights from available data to support complex decision making. BI was first coined in the 1990s by Gartner Group and evolved from management information systems of the 1970s and executive information systems. By 2005, BI systems began including artificial intelligence and analytical capabilities. BI provides relevant reporting, key insights, ability to stay ahead of competitors, accurate data, improved customer satisfaction, understanding of growth patterns, efficiency, faster decision making, greater operational efficiency and revenue, and bigger profits.
Unit III Components and Arch of DWH ETL Process.pdfShivarkarSandipBP Lubricants wanted to improve management information and business intelligence following a merger by integrating data from disparate source systems without delaying implementation of an ERP system. As part of its Business Intelligence and Global Standards program, BP Lubricants implemented a pilot of the Kalida adaptive enterprise data warehousing solution to prepare, implement, operate, and manage a data warehouse for consolidating data from its various sources. The data warehouse would provide consistent, transparent, and accessible management reporting and business intelligence across the merged organization.
Air pollution is contamination of the indoor or outdoor environment by any ch...dhanashree78Air pollution is contamination of the indoor or outdoor environment by any chemical, physical or biological agent that modifies the natural characteristics of the atmosphere.
Household combustion devices, motor vehicles, industrial facilities and forest fires are common sources of air pollution. Pollutants of major public health concern include particulate matter, carbon monoxide, ozone, nitrogen dioxide and sulfur dioxide. Outdoor and indoor air pollution cause respiratory and other diseases and are important sources of morbidity and mortality.
WHO data show that almost all of the global population (99%) breathe air that exceeds WHO guideline limits and contains high levels of pollutants, with low- and middle-income countries suffering from the highest exposures.
Air quality is closely linked to the earth’s climate and ecosystems globally. Many of the drivers of air pollution (i.e. combustion of fossil fuels) are also sources of greenhouse gas emissions. Policies to reduce air pollution, therefore, offer a win-win strategy for both climate and health, lowering the burden of disease attributable to air pollution, as well as contributing to the near- and long-term mitigation of climate change.
IPC-9716_2024 Requirements for Automated Optical Inspection (AOI) Process Con...ssuserd9338bIPC-9716_2024 Requirements for Automated Optical Inspection (AOI) Process Control for Printed Board Assemblies.pdf
Lecture -3 Cold water supply system.pptxrabiaatif2The presentation on Cold Water Supply explored the fundamental principles of water distribution in buildings. It covered sources of cold water, including municipal supply, wells, and rainwater harvesting. Key components such as storage tanks, pipes, valves, and pumps were discussed for efficient water delivery. Various distribution systems, including direct and indirect supply methods, were analyzed for residential and commercial applications. The presentation emphasized water quality, pressure regulation, and contamination prevention. Common issues like pipe corrosion, leaks, and pressure drops were addressed along with maintenance strategies. Diagrams and case studies illustrated system layouts and best practices for optimal performance.
15. Smart Cities Big Data, Civic Hackers, and the Quest for a New Utopia.pdfNgocThang9Smart Cities Big Data, Civic Hackers, and the Quest for a New Utopia
Optimization of Cumulative Energy, Exergy Consumption and Environmental Life ...J. Agricultural MachineryOptimal use of resources, including energy, is one of the most important principles in modern and sustainable agricultural systems. Exergy analysis and life cycle assessment were used to study the efficient use of inputs, energy consumption reduction, and various environmental effects in the corn production system in Lorestan province, Iran. The required data were collected from farmers in Lorestan province using random sampling. The Cobb-Douglas equation and data envelopment analysis were utilized for modeling and optimizing cumulative energy and exergy consumption (CEnC and CExC) and devising strategies to mitigate the environmental impacts of corn production. The Cobb-Douglas equation results revealed that electricity, diesel fuel, and N-fertilizer were the major contributors to CExC in the corn production system. According to the Data Envelopment Analysis (DEA) results, the average efficiency of all farms in terms of CExC was 94.7% in the CCR model and 97.8% in the BCC model. Furthermore, the results indicated that there was excessive consumption of inputs, particularly potassium and phosphate fertilizers. By adopting more suitable methods based on DEA of efficient farmers, it was possible to save 6.47, 10.42, 7.40, 13.32, 31.29, 3.25, and 6.78% in the exergy consumption of diesel fuel, electricity, machinery, chemical fertilizers, biocides, seeds, and irrigation, respectively.
Best KNow Hydrogen Fuel Production in the World The cost in USD kwh for H2Daniel DonatelliThe cost in USD/kwh for H2
Daniel Donatelli
Secure Supplies Group
Index
• Introduction - Page 3
• The Need for Hydrogen Fueling - Page 5
• Pure H2 Fueling Technology - Page 7
• Blend Gas Fueling: A Transition Strategy - Page 10
• Performance Metrics: H2 vs. Fossil Fuels - Page 12
• Cost Analysis and Economic Viability - Page 15
• Innovations Driving Leadership - Page 18
• Laminar Flame Speed Adjustment
• Heat Management Systems
• The Donatelli Cycle
• Non-Carnot Cycle Applications
• Case Studies and Real-World Applications - Page 22
• Conclusion: Secure Supplies’ Leadership in Hydrogen Fueling - Page 27
US Patented ReGenX Generator, ReGen-X Quatum Motor EV Regenerative Accelerati...Thane Heins NOBEL PRIZE WINNING ENERGY RESEARCHERPreface: The ReGenX Generator innovation operates with a US Patented Frequency Dependent Load
Current Delay which delays the creation and storage of created Electromagnetic Field Energy around
the exterior of the generator coil. The result is the created and Time Delayed Electromagnetic Field
Energy performs any magnitude of Positive Electro-Mechanical Work at infinite efficiency on the
generator's Rotating Magnetic Field, increasing its Kinetic Energy and increasing the Kinetic Energy of
an EV or ICE Vehicle to any magnitude without requiring any Externally Supplied Input Energy. In
Electricity Generation applications the ReGenX Generator innovation now allows all electricity to be
generated at infinite efficiency requiring zero Input Energy, zero Input Energy Cost, while producing
zero Greenhouse Gas Emissions, zero Air Pollution and zero Nuclear Waste during the Electricity
Generation Phase. In Electric Motor operation the ReGen-X Quantum Motor now allows any
magnitude of Work to be performed with zero Electric Input Energy.
Demonstration Protocol: The demonstration protocol involves three prototypes;
1. Protytpe #1, demonstrates the ReGenX Generator's Load Current Time Delay when compared
to the instantaneous Load Current Sine Wave for a Conventional Generator Coil.
2. In the Conventional Faraday Generator operation the created Electromagnetic Field Energy
performs Negative Work at infinite efficiency and it reduces the Kinetic Energy of the system.
3. The Magnitude of the Negative Work / System Kinetic Energy Reduction (in Joules) is equal to
the Magnitude of the created Electromagnetic Field Energy (also in Joules).
4. When the Conventional Faraday Generator is placed On-Load, Negative Work is performed and
the speed of the system decreases according to Lenz's Law of Induction.
5. In order to maintain the System Speed and the Electric Power magnitude to the Loads,
additional Input Power must be supplied to the Prime Mover and additional Mechanical Input
Power must be supplied to the Generator's Drive Shaft.
6. For example, if 100 Watts of Electric Power is delivered to the Load by the Faraday Generator,
an additional >100 Watts of Mechanical Input Power must be supplied to the Generator's Drive
Shaft by the Prime Mover.
7. If 1 MW of Electric Power is delivered to the Load by the Faraday Generator, an additional >1
MW Watts of Mechanical Input Power must be supplied to the Generator's Drive Shaft by the
Prime Mover.
8. Generally speaking the ratio is 2 Watts of Mechanical Input Power to every 1 Watt of Electric
Output Power generated.
9. The increase in Drive Shaft Mechanical Input Power is provided by the Prime Mover and the
Input Energy Source which powers the Prime Mover.
10. In the Heins ReGenX Generator operation the created and Time Delayed Electromagnetic Field
Energy performs Positive Work at infinite efficiency and it increases the Kinetic Energy of the
system.
Engineering at Lovely Professional University (LPU).pdfSona LPU’s engineering programs provide students with the skills and knowledge to excel in the rapidly evolving tech industry, ensuring a bright and successful future. With world-class infrastructure, top-tier placements, and global exposure, LPU stands as a premier destination for aspiring engineers.
1. Sanjivani Rural Education Society’s
Sanjivani College of Engineering, Kopargaon-423 603
(An Autonomous Institute, Affiliated to Savitribai Phule Pune University, Pune)
NACC ‘A’ Grade Accredited, ISO 9001:2015 Certified
Department of Computer Engineering
(NBA Accredited)
Prof. S. A. Shivarkar
Assistant Professor
Contact No.8275032712
Email- shivarkarsandipcomp@sanjivani.org.in
Subject- Supervised Modeling and AI Technologies (CO9401)
Unit –I: Supervised Learning Naïve Bayes and K-NN
2. Content
Baysian classifier, Naive Bayes classifier cases, Constraints of Naïve bayes,
Advantages of Naïve Bayes, Comparison of Naïve bayes with other
classifiers,
K-nearest neighbor classifier, K-nearest neighbor classifier selection criteria,
Constraints of K-nearest neighbor, Advantages and Disadvantages of K-
nearest neighbor algorithms, controlling complexity of K-NN.
3. Supervised vs. Unsupervised Learning
Supervised learning (classification)- Prediction either Yes or No
Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations.
New data is classified based on the training set.
Unsupervised learning (clustering)
The class labels of training data is unknown.
Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data.
4. Prediction Problems: Classification vs. Numeric Prediction
Classification
Predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying new
data.
Numeric Prediction
Models continuous-valued functions, i.e., predicts unknown or missing
values .
Typical applications
Credit/loan approval: Loan approved Yes or No
Medical diagnosis: if a tumor is cancerous or benign
Fraud detection: if a transaction is fraudulent
Web page categorization: which category it is
5. Classification—A Two-Step Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as determined by the class
label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees, or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified result from the model
Accuracy rate is the percentage of test set samples that are correctly classified by the
model
Test set is independent of training set (otherwise overfitting)
If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation (test) set
8. Issues: Data Preparation
Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
9. Issues: Evaluating Classification Methods
Accuracy
classifier accuracy: predicting class label
predictor accuracy: guessing value of predicted attributes
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules
10. Issues: Evaluating Classification Methods: Accuracy
Accuracy simply measures how often the classifier correctly predicts.
We can define accuracy as the ratio of the number of correct predictions and the total number of
predictions.
For binary classification (only two class labels) we use TP and TN.
12. Bayesian Classification: Why?
A statistical classifier: performs probabilistic prediction, i.e., predicts class
membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable
performance with decision tree and selected neural network classifiers
Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct — prior knowledge can be combined with
observed data
Standard: Even when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods can be
measured
13. Bayes’ Theorem: Basics
Total probability Theorem:
Bayes’ Theorem:
Let X be a data sample (“evidence”): class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), (i.e., posteriori probability): the probability that the
hypothesis holds given the observed data sample X
P(H) (prior probability): the initial probability
E.g., X will buy computer, regardless of age, income, …
P(X): probability that sample data is observed
P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis holds
E.g., Given that X will buy computer, the prob. that X is 31..40, medium income
)
(
)
1
|
(
)
( i
A
P
M
i i
A
B
P
B
P
)
(
/
)
(
)
|
(
)
(
)
(
)
|
(
)
|
( X
X
X
X
X P
H
P
H
P
P
H
P
H
P
H
P
14. Prediction Based on Bayes’ Theorem
Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
Informally, this can be viewed as
posteriori = likelihood x prior/evidence
Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
)
(
/
)
(
)
|
(
)
(
)
(
)
|
(
)
|
( X
X
X
X
X P
H
P
H
P
P
H
P
H
P
H
P
15. Classification Is to Derive the Maximum Posteriori
Let D be a training set of tuples and their associated class labels, and each tuple is
represented by an n-D attribute vector X = (x1, x2, …, xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)
This can be derived from Bayes’ theorem
Since P(X) is constant for all classes, only
needs to be maximized
)
(
)
(
)
|
(
)
|
(
X
X
X
P
i
C
P
i
C
P
i
C
P
)
(
)
|
(
)
|
( i
C
P
i
C
P
i
C
P X
X
16. Naïve Bayes Classifier
A simplified assumption: attributes are conditionally independent (i.e., no
dependence relation between attributes):
This greatly reduces the computation cost: Only counts the class distribution
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by
|Ci, D| (# of tuples of Ci in D)
If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian
distribution with a mean μ and standard deviation σ
and P(xk|Ci) is
)
|
(
...
)
|
(
)
|
(
1
)
|
(
)
|
(
2
1
Ci
x
P
Ci
x
P
Ci
x
P
n
k
Ci
x
P
Ci
P
n
k
X
2
2
2
)
(
2
1
)
,
,
(
x
e
x
g
)
,
,
(
)
|
( i
i C
C
k
x
g
Ci
P
X
17. Naïve Bayes Classifier Example 1
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data to be classified:
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
18. Naïve Bayes Classifier Example 1 Solution
Age P(Y) P(N)
<=30 29=0.22 3/5=0.6
31…..40 49=0.44 0
>40 39=0.33 25=0.4
Prior Probability
P(Buys Computer= Yes)=914=0.642
P(Buys Computer= No)=514=0.357
Posterior/ Conditional Probability
Income P(Y) P(N)
High 2/9=0.22 2/5=0.4
Medium 4/9=0.44 2/5=0.4
Low 3/9=0.33 1/5=0.2
Credit
Rating
P(Y) P(N)
Fair 6/9=0.67 2/5=0.4
Excellent 3/9=0.33 3/5=0.6
Student P(Y) P(N)
Yes 6/9=0.67 3/5=0.6
No 3/9=0.33 2/5=0.4
19. Naïve Bayes Classifier Example 1 Solution
Age P(Y) P(N)
<=30 29=0.22 3/5=0.6
31…..40 49=0.44 0
>40 39=0.33 25=0.4
Prior Probability
P(Buys Computer= Yes)=914=0.642
P(Buys Computer= No)=514=0.357
Posterior/ Conditional Probability
Income P(Y) P(N)
High 2/9=0.22 2/5=0.4
Medium 4/9=0.44 2/5=0.4
Low 3/9=0.33 1/5=0.2
Credit
Rating
P(Y) P(N)
Fair 6/9=0.67 2/5=0.4
Excellent 3/9=0.33 3/5=0.6
Student P(Y) P(N)
Yes 6/9=0.67 3/5=0.6
No 3/9=0.33 2/5=0.4
P(Yes)=0.22*0.44*0.33*0.67*0.33*0.22*0.44*0.33*
0.67*0.33*0.642=
P(No)=0.6*0.4*0.6*0.4*0.4*0.4*0.2*0.4*0.6*0.357=
20. Naïve Bayes Classifier Example 1 Solution
Age P(Y) P(N)
<=30 29=0.22 3/5=0.6
31…..40 49=0.44 0
>40 39=0.33 25=0.4
Data to be classified
Age=31…40, Income= High, Student = No,
Credit Rating= Excellent, Buys Computer?
Income P(Y) P(N)
High 2/9=0.22 2/5=0.4
Medium 4/9=0.44 2/5=0.4
Low 3/9=0.33 1/5=0.2
Credit
Rating
P(Y) P(N)
Fair 6/9=0.67 2/5=0.4
Excellent 3/9=0.33 3/5=0.6
Student P(Y) P(N)
Yes 6/9=0.67 3/5=0.6
No 3/9=0.33 2/5=0.4
P(Yes)=0.44*0.22*0.33*0.33*0.642=0.0067
P(No)=0*0.4*0.4*0.6*0.357=0
AS P(Yes)> P(No), so Buys Computer=Yes
21. Naïve Bayes Classifier Example 2
Class:
C1:Plaing Tennis = ‘yes’
C2: C1:Plaing Tennis = ‘no’
Data to be classified:
Outlook=Rainy,
Temp=Hot,
Humidity=High, Windy=
Strong, Play= ?
Day Outlook Temp Humidity Windy Play
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rainy Mild High Weak Yes
D5 Rainy Cool Normal Weak Yes
D6 Rainy Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rainy Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rainy Mild High Strong No
22. Naïve Bayes Classifier Example 2 Solution
Outlook P(Y) P(N)
Sunny 29=0.22 3/5=0.6
Overcast 49=0.44 0
Rainy 39=0.33 25=0.4
Prior Probability
P(Y)=914=0.6428
P(N)=514=0.3571
Posterior/ Conditional Probability
Temp P(Y) P(N)
Hot 2/9=0.22 2/5=0.4
Mild 4/9=0.44 2/5=0.4
Cool 3/9=0.33 1/5=0.2
Wind P(Y) P(N)
Weak 6/9=0.67 2/5=0.4
Strong 3/9=0.33 3/5=0.6
Humidity P(Y) P(N)
High 3/9=0.33 4/5=0.8
Normal 6/9=0.66 1/5=0.2
P(Y)=0.333*0.222*0.33*0.33*0.6428=0.0052
P(N)=0.4*0.4*0.8*0.6*0.3571=0.0274
As P(Y) <P(N) , WILL NOT PLAY
23. Naïve Bayes Classifier Example 3
Given the training set for
classification problem into two
classes “fraud” and “normal”. There
are two attributes A1 and A2 taking
values 0 or 1. The Bayes classifier
classifies the instance (A1=1,
A2=1) into class?
A1 A2 Class
1 0 fraud
1 1 fraud
1 1 fraud
1 0 normal
1 1 fraud
0 0 normal
0 0 normal
0 0 normal
1 1 normal
1 0 normal
24. Benefits of Naïve Bayes Classifier
It is simple and easy to implement
It doesn’t require as much training data
It handles both continuous and discrete data
It is highly scalable with the number of predictors and data
points
It is fast and can be used to make real-time predictions
It is not sensitive to irrelevant features
25. Limitations of Naïve Bayes Classifier
Naive Bayes assumes that all predictors (or features) are
independent, rarely happening in real life. This limits the
applicability of this algorithm in real-world use cases.
This algorithm faces the ‘zero-frequency problem’ where it assigns
zero probability to a categorical variable whose category in the test
data set wasn’t available in the training dataset. It would be best if
you used a smoothing technique to overcome this issue.
Its estimations can be wrong in some cases, so you shouldn’t take
its probability outputs very seriously.
26. Type of Distances used in Machine Learning algorithm
Distance metric are used to represent distances between any two
data points.
There are many distance metrics.
1. Euclidean Distance
2. Manhattan Distance
3. Minkowski Distance
4. Hamming Distance
27. Type of Distances used in Machine Learning algorithm
Euclidean distance :√(X₂-X₁)²+(Y₂-Y₁)²
Lets calculate Distance between { 2, 3 } from { 3, 5 }
= √(3-2)²+(5-3)²
=√(1)²+(2)²
= √(1+4
= √(5
Calculate Distance between { 40, 20 } from {20, 35 }
28. Type of Distances used in Machine Learning algorithm
Manhattan Distance
The Manhattan distance as the sum of absolute differences
Lets calculate Distance between { 2, 3 } from { 3, 5 }
=|2–3|+|3–5|
= |-1| + |-2|
= 1+2
= 3
Calculate Distance between { 40, 20 } from {20, 35 }
|x1 — x2| + |y1 — y2|
29. The k-Nearest Neighbor Algorithm
The k-nearest neighbors (KNN) algorithm is a non-parametric
Supervised learning classifier
Uses proximity to make classifications or predictions about the grouping of an
individual data point
Popular and simplest classification and regression classifiers used in machine
learning today
Mostly suited for Binary Classification
30. The k-Nearest Neighbor Algorithm
All instances correspond to points in the n-D space
The nearest neighbor are defined in terms of Euclidean distance, dist(X1, X2)
Target function could be discrete- or real- valued
For discrete-valued, k-NN returns the most common value among the k training
examples nearest to xq
Vonoroi diagram: the decision surface induced by 1-NN for a typical set of
training examples
.
_
_ xq
+
_ _
+
_
_
+
.
.
.
. .
31. Step #1 - Assign a value to K.
Step #2 - Calculate the distance between the new data entry and all
other existing data entries. Arrange them in ascending order.
Step #3 - Find the K nearest neighbors to the new entry based on the
calculated distances.
Step #4 - Assign the new data entry to the majority class in the
nearest neighbors.
The k-Nearest Neighbor Algorithm Steps
32. Type of Distances used in Machine Learning algorithm
Euclidean distance :√(X₂-X₁)²+(Y₂-Y₁)²
Lets calculate Distance between { 2, 3 } from { 3, 5 }
= √(3-2)²+(5-3)²
=√(1)²+(2)²
= √(1+4
= √(5
Calculate Distance between { 40, 20 } from {20, 35 }
33. Type of Distances used in Machine Learning algorithm
Manhattan Distance
The Manhattan distance as the sum of absolute differences
Lets calculate Distance between { 2, 3 } from { 3, 5 }
=|2–3|+|3–5|
= |-1| + |-2|
= 1+2
= 3
Calculate Distance between { 40, 20 } from {20, 35 }
|x1 — x2| + |y1 — y2|
34. For given Barbie movie IMDb Rating = 7.4, Duration = 114, Genre ?
Assume K=3, use Euclidean distance
Euclidean distance :√(X₂-X₁)²+(Y₂-Y₁)²
The k-Nearest Neighbor Algorithm
IMDb Rating Duration Genre
8.0 ( Mission Impossible) 160 Action
6.2 (Gadar 2) 170 Action
7.2 (Rocky and Rani) 168 Comedy
8.2 ( OMG 2) 155 Comedy
35. Step 1: Calculate the distances.
Calculate the distance of new movie and each movie in dataset.
Distance to (8.0,160)=√(7.4-8.0)²+(114-160)² = √(0.36+2116) = 46.00
Distance to (6.2,160)=√(7.4-6.2)²+(114-160)² = √(1.44+2116) = 56.01
Distance to (7.2,168)=√(7.4-7.2)²+(114-168)² = √(0.04+2916) = 54.00
Distance to (8.2,155)=√(7.4-8.2)²+(114-155)² = √(0.64+1618) = 41.00
The k-Nearest Neighbor Algorithm
(X ₁,Y₁) (X₂,Y ₂)
Step 2: Select K Nearest Neighbor.
For K=1, Shortest distance is 41.00
So, Barbie movie Genre is Comedy.
36. The k-Nearest Neighbor Algorithm
Step 3: Majority Voting (Classification)
For K=3, Shortest distance is 41.00, 46.00 and 54
Action, Comedy, Comedy
So, Barbie movie Genre is Comedy.
37. For given data test tuple Brightness=20, saturation=35, Class?
Assume K=5, use Euclidean distance
BRIGHTNESS SATURATION CLASS
40 20 Red
50 50 Blue
60 90 Blue
10 25 Red
70 70 Blue
60 10 Red
25 80 Blue
Euclidean distance :√(X₂-X₁)²+(Y₂-Y₁)²
The k-Nearest Neighbor Algorithm
38. DEPARTMENT OF COMPUTER ENGINEERING, Sanjivani COE, Kopargaon 38
Reference
Han, Jiawei Kamber, Micheline Pei and Jian, “Data Mining: Concepts and
Techniques”,Elsevier Publishers, ISBN:9780123814791, 9780123814807.
https://onlinecourses.nptel.ac.in/noc24_cs22
https://medium.com/analytics-vidhya/type-of-distances-used-in-machine-
learning-algorithm-c873467140de
https://www.freecodecamp.org/news/k-nearest-neighbors-algorithm-
classifiers-and-model-example/