Naive Bayes#

Note

Naive Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with simple(naive) independence assumptions between the features.

Discriminative and Generative Algorithms#

Discriminative:try to learn p(y|x) directly, such as Logistic Regression.

Generative:try to learn p(x|y) and p(y), such as Naive Bayes.

When predicting, use bayes rule:

p(y|x)=p(x|y)p(y)p(x)

then:

y^=argmaxy p(x|y)p(y)p(x)=argmaxy p(x|y)p(y)

Algorithm#

Suppose x=(x1,x2,...,xd) and xj is binary, then:

p(x|y)=p(x1,x2,...,xd|y)=p(x1|y)p(x2|y,x1)...p(xd|y,x1,...,xd1)

Naive bayes independent assumption:

p(x1|y)p(x2|y,x1)...p(xd|y,x1,...,xd1)=p(x1|y)p(x2|y)...p(xd|y)

Parameterized by:

yBernoulli(ϕ)
xj|y=1Bernoulli(ϕj|y=1)
xj|y=0Bernoulli(ϕj|y=0)

note:

k=i=1n1{y(i)=1}
sj=i=1n1{xj(i)=1y(i)=1}
tj=i=1n1{xj(i)=1y(i)=0}

We can write down the joint log likelihood of the data:

L(ϕ,ϕj|y=1,ϕj|y=0)=logi=1np(x(i),y(i))=i=1nlog(p(x(i),y(i)))=i=1nlog(p(y(i))j=1dp(xj(i)|y(i)))=y(i)=1(log(ϕ)+j=1dlog(p(xj(i)|y=1)))+y(i)=0(log(1ϕ)+j=1dlog(p(xj(i)|y=0)))=k log(ϕ)+(nk)log(1ϕ)+j=1d(sj log(ϕj|y=1)+(ksj)log(1ϕj|y=1)+tj log(ϕj|y=0)+(nktj)log(1ϕj|y=0)

Maximum likelihood estimate result:

ϕ=kn
ϕj|y=1=sjk
ϕj|y=0=tjnk

Laplace Smoothing#

To predict unseen data, we add 1 to the numerator, add 2 to the denominator:

ϕj|y=1=sj+1k+2
ϕj|y=0=tj+1nk+2

It’s actually Binary Bernoulli Naive Bayes and implemented in BernoulliNB.

It assumes each feature is binary-valued, if handed any other kind of data, a BernoulliNB instance may binarize its input.

import numpy as np

rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 4, 5])
Copy to clipboard
from sklearn.naive_bayes import BernoulliNB

clf = BernoulliNB()
clf.fit(X, y)

clf.predict(X[2:3])
Copy to clipboard
array([3])
Copy to clipboard

Multinomial Naive Bayes#

MultinomialNB implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification.

This distribution is parametrized by vectors θy=(θy1,...,θyn) for each class y, where n is the number of features(in text classification, the size of the vocabulary) and θyi=p(xi|y).

The parameters θy is estimated by a smoothed version of maximum likelihood, i.e. relative frequency counting:

θ^yi=Nyi+αNy+αn

where Nyi is the number of times feature i appears in a sample of class y, and Ny=i=1nNyi, α0 is smooth hyperparameter.

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(X, y)

clf.predict(X[2:3])
Copy to clipboard
array([3])
Copy to clipboard

Gaussian Naive Bayes#

When xj is real-valued, we can assume the likelihood of the features is Gaussian:

p(xj|y)=12πσy2exp((xjμy)22σy2)

Just like the binary case, parameters μy and σy are estimated using maximum likelihood.

This is Gaussian Naive Bayes and implemented in GaussianNB.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
print("Number of mislabeled points out of a total %d points : %d" 
      % (X_test.shape[0], (y_test != y_pred).sum()))
Copy to clipboard
Number of mislabeled points out of a total 75 points : 4
Copy to clipboard