狠狠撸

Na?ve Bayes Classification
? Naive Bayes is among one of the most simple and powerful algorithms
for classification based on Bayes’ Theorem with an assumption of
independence among predictors.
? Naive Bayes model is easy to build and particularly useful for very large
data sets. There are two parts to this algorithm:
1. Naive
2. Bayes
? The Naive Bayes classifier assumes that the presence of a feature in a
class is unrelated to any other feature. The Algorithm is ‘naive’ because
it makes assumptions that may or may not turn out to be correct.

Contd..
? Naive Bayes model is easy to build and particularly useful for very
large data sets.
? Along with simplicity, Naive Bayes is known to outperform even highly
sophisticated classification methods.

Prerequisites for Bayes Theorem
Experiment
? An experiment is defined as the planned operation carried out under controlled
condition such as tossing a coin, drawing a card and rolling a dice, etc.
Sample Space
? During an experiment what we get as a result is called as possible outcomes and
the set of all possible outcome of an event is known as sample space. For
example, if we are rolling a dice, sample space will be:
S1 = {1, 2, 3, 4, 5, 6}
Similarly, if our experiment is related to toss a coin and recording its outcomes,
then sample space will be:
S2 = {Head, Tail}

Event
? Event is defined as subset of sample space in an experiment. Further, it is also
called as set of outcomes
? Assume in our experiment of rolling a dice, there are two event A and B such
that;
? A = Event when an even number is obtained = {2, 4, 6}
? B = Event when a number is greater than 4 = {5, 6}
? Probability of the event A ''P(A)''= Number of favourable outcomes / Total
number of possible outcomes
P(E) = 3/6 =1/2 =0.5
? Similarly, Probability of the event B ''P(B)''= Number of favourable outcomes /
Total number of possible outcomes
=2/6 =1/3 =0.333
? Union of event A and B:
A B = {2, 4, 5, 6}
∪
P(A B)=P(A)+P(B)-P(A
∪ ∩B)

Event
? Intersection of event A and B: A∩B= {6}
When A and B are independent events, the probability of A
intersection B, P(A B) = P(A)×P(B).
?
? Disjoint Event: If the intersection of the event A and B
is an empty set or null then such events are known
as disjoint event or mutually exclusive events also.

Exhaustive Event:
? As per the name suggests, a set of events where at least one event occurs at a time, called exhaustive event of an
experiment.
? Thus, two events A and B are said to be exhaustive if either A or B definitely occur at a time and both are mutually
exclusive for e.g., while tossing a coin, either it will be a Head or may be a Tail.
Independent Event:
? Two events are said to be independent when occurrence of one event does not affect the occurrence of another
event. In simple words we can say that the probability of outcome of both events does not depends one another.
? Mathematically, two events A and B are said to be independent if:
? P(A ∩ B) = P(AB) = P(A)*P(B)
Conditional Probability:
? Conditional probability is defined as the probability of an event A, given that another event B has already
occurred (i.e. A conditional B). This is represented by P(A|B) and we can define it as:
? P(A|B) = P(A ∩ B) / P(B)

Law of Total Probability
? For two events A and B associated with a sample space S, the sample space can be divided into a
set A ∩ B , A ∩ B, A ∩ B, A ∩ B . This set is said to be mutually disjoint or pairwise disjoint because
′ ′ ′ ′
any pair of sets in it is disjoint. Elements of this set are better known as a partition of sample
space.

? Law of Total Probability Statement
? Let events C1, C2 . . . Cn form partitions of the sample space S, where all the events
have a non-zero probability of occurrence. For any event, A associated with S,
according to the total probability theorem,
? P(A) = P(C1)P(A| C1) + P(C2)P(A|C2) + P(C3)P(A| C3) + . . . . . + P(Cn)P(A| Cn)
? Hence, the theorem can be stated in form of equation as,

? If we put a naive assumption to the Bayes’ theorem, which is,
independence among the features. So now, we split evidence into the
independent parts.
? Now, if any two events A and B are independent, then,
P(A,B) = P(A)P(B)
Hence, we reach to the result:
which can be expressed as:

Prediction using Bayes Theorem - Example
The dataset is divided into two parts,
namely, feature matrix and the
response vector.
?Feature matrix contains all the
vectors(rows) of dataset in which each
vector consists of the value of
dependent features. In above dataset,
features are ‘Outlook’, ‘Temperature’,
‘Humidity’ and ‘Windy’.
?Response vector contains the value of
class variable(prediction or output) for
each row of feature matrix. In above
dataset, the class variable name is ‘Play
golf’.

Example contd..
We need to find P(xi | yj) for each xi in X and yj in y. All these calculations have been demonstrated in the
tables below:
We have calculated P(xi | yj) for each xi in X and yj in y
manually in the tables 1-4. For example, probability of playing
golf given that the temperature is cool, i.e
P(temp = cool | play golf = Yes) = 3/9.
Also, we need to find class probabilities (P(y)) which has been
calculated in the table 5. For example, P(play golf = Yes) =
9/14.
So now, we are done with our pre-computations and the
classifier is ready!

? Let us test it on a new set of features (let us call it today):
? today = (Sunny, Hot, Normal, False)
? So, probability of playing golf is given by:
? and probability to not play golf is given by:
? Since, P(today) is common in both probabilities, we can ignore P(today) and
find proportional probabilities as:
and
Test – Rainy, Mild, High, False

? Now, since
? These numbers can be converted into a probability by making the sum
equal to 1 (normalization):
And
Since
So, prediction that golf would be played is ‘Yes’.

Introduction to Naive bayes and baysian belief network

Advantages of Na?ve Bayes
? It is not only a simple approach but also a fast and accurate method for
prediction.
? Naive Bayes has very low computation cost.
? It can efficiently work on a large dataset.
? It performs well in case of discrete response variable compared to the
continuous variable.
? It can be used with multiple class prediction problems.
? It also performs well in the case of text analytics problems.
? When the assumption of independence holds, a Naive Bayes classifier
performs better compared to other models like logistic regression.

Disadvantages
? The assumption of independent features. In practice, it is almost
impossible that model will get a set of predictors which are entirely
independent.
? If there is no training tuple of a particular class, this causes zero
posterior probability. In this case, the model is unable to make
predictions. This problem is known as Zero Probability/Frequency
Problem.

Applications
? Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it
could be used for making predictions in real time.
? Multi class Prediction: This algorithm is also well known for multi class prediction feature.
Here we can predict the probability of multiple classes of target variable.
? Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers are mostly
used in text classification. Due to better result in multi class problems and independence
rule it has higher success rate as compared to other algorithms. As a result, it is widely
used in Spam filtering (identify spam e-mail) and Sentiment Analysis (in social media
analysis, to identify positive and negative customer sentiments)
? Recommendation System: Naive Bayes Classifier and Collaborative Filtering together
builds a Recommendation System that uses machine learning and data mining techniques
to filter unseen information and predict whether a user would like a given resource or not

Zero Probability Problem in Na?ve Bayes Classifier

Bayesian Belief Network
? "A Bayesian network is a probabilistic graphical model which
represents a set of variables and their conditional dependencies using
a directed acyclic graph."
? It is also called a Bayes network, belief network, decision network,
or Bayesian model.
? Bayesian networks are probabilistic, because these networks are built
from a probability distribution, and also use probability theory for
prediction and anomaly detection.

? Real world applications are probabilistic in nature, and to represent
the relationship between multiple events, we need a Bayesian
network.
? It can also be used in various tasks including prediction, anomaly
detection, diagnostics, automated insight, reasoning, time series
prediction, and decision making under uncertainty.
? Bayesian Network can be used for building models from data and
experts opinions, and it consists of two parts:
? Directed Acyclic Graph
? Conditional Probabilities Table (CPT).

? A Bayesian network graph is made up of nodes and Arcs
(directed links), where:
? Each node corresponds to the random variables, and a variable
can be continuous or discrete.
? Arc or directed arrows represent the causal relationship or
conditional probabilities between random variables. These
directed links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the
other node, and if there is no directed link that means that
nodes are independent with each other
? In the diagram, A, B, C, and D are random variables represented by
the nodes of the network graph.
? If we are considering node B, which is connected with node A by a
directed arrow, then node A is called the parent of Node B.
? Node C is conditionally independent of node A.

? Each node in the Bayesian network has condition probability
distribution P(Xi |Parent(Xi) ), which determines the effect of the
parent on that node.
? Bayesian network is based on Joint probability distribution and
conditional probability.

31
Bayesian Belief Networks (BNs)
? Definition: BN = (DAG, CPD)
– DAG: directed acyclic graph (BN’s structure)
? Nodes: random variables (typically binary or discrete,
but methods also exist to handle continuous variables)
? Arcs: indicate probabilistic dependencies between nodes
(lack of link signifies conditional independence)
– CPD: conditional probability distribution (BN’s parameters)
? Conditional probabilities at each node, usually stored as a table
(conditional probability table, or CPT)
– Root nodes are a special case – no parents, so just use priors
in CPD:

33
? Conditional independence assumption
–
where q is any set of variables
(nodes) other than and its successors
– blocks influence of other nodes on
and its successors (q influences only
through variables in )
– With this assumption, the complete joint probability distribution of all
variables in the network can be represented by (recovered from) local
CPDs by chaining these CPDs:
q
Conditional independence and chaining

35
A B C
Conditional
Independence
The joint probability is: P(A,B,C)= P(A)*P(B)*P(C)
A
B C The joint probability is: P(A,B,C)= P(B | A)*P(C|A)*P(A)
C
A B
Causal
Independence The joint probability is: P(A,B,C)= P(C |A, B)*P(A)*P(B)
B
A C
Markov
Independence
The joint probability is: P(A,B,C)= P(C |B)*P(B|A)*P(A)
Marginal
Independence

Example
? Example: Harry installed a new burglar alarm at his home to detect burglary.
The alarm reliably responds at detecting a burglary but also responds for minor
earthquakes. Harry has two neighbours David and Sophia, who have taken a
responsibility to inform Harry at work when they hear the alarm. David always
calls Harry when he hears the alarm, but sometimes he got confused with the
phone ringing and calls at that time too. On the other hand, Sophia likes to listen
to high music, so sometimes she misses to hear the alarm. Here we would like to
compute the probability of Burglary Alarm.
? Problem:
? Calculate the probability that alarm has sounded, but there is neither a
burglary, nor an earthquake occurred, and David and Sophia both called
the Harry.
36

? We can write the events of problem statement in the form of
probability: P[D, S, A, B, E], can rewrite the above
probability statement using joint probability distribution:
P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]
=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]
= P [D| A]. P [ S| A, B, E]. P[ A, B, E]
= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]
= P[D | A ]. P[S | A]. P[A| B, E]. P[B]. P[E]
Calculate the probability that alarm has sounded, but there
is neither a burglary, nor an earthquake occurred, and
David and Sophia both called the Harry.
P(S, D, A, ?B, ?E) = P (S|A) *P (D|A)*P (A|?B,
?E) *P (?B) *P (?E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045
38

39
Maximum A Posteriori (MAP)
Hypothesis and Maximum Likelihood
? Goal: To find the most probable hypothesis h from a set of
candidate hypotheses H given the observed data D.
? MAP Hypothesis, hMAP = argmax h H
∈ P(h|D)
= argmax h H
∈ P(D|h)P(h)/P(D)
= argmax h H
∈ P(D|h)P(h)
? If every hypothesis in H is equally probable a priori, we only
need to consider the likelihood of the data D given h, P(D|
h). Then, hMAP becomes the Maximum Likelihood,
hML= argmax h H
∈ P(D|h)

狠狠撸

Introduction to Naive bayes and baysian belief network

More Related Content

Similar to Introduction to Naive bayes and baysian belief network (20)

Recently uploaded (20)

Introduction to Naive bayes and baysian belief network