Naive Bayes#

Note

Naive Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with simple(naive) independence assumptions between the features.

Discriminative and Generative Algorithms#

Discriminative:try to learn \(p(y|x)\) directly, such as Logistic Regression.

Generative:try to learn \(p(x|y)\) and \(p(y)\), such as Naive Bayes.

When predicting, use bayes rule:

\[p(y|x)=\frac{p(x|y)p(y)}{p(x)}\]

then:

\[\hat{y}=\underset{y}{\mbox{argmax}}\ \frac{p(x|y)p(y)}{p(x)}= \underset{y}{\mbox{argmax}}\ {p(x|y)p(y)}\]

Algorithm#

Suppose \(x=(x_{1}, x_{2},..., x_{d})\) and \(x_{j}\) is binary, then:

\[ p(x|y) = p(x_{1},x_{2},...,x_{d}|y) = p(x_{1}|y)p(x_{2}|y,x_{1})...p(x_{d}|y,x_{1},...,x_{d-1}) \]

Naive bayes independent assumption:

\[ p(x_{1}|y)p(x_{2}|y,x_{1})...p(x_{d}|y,x_{1},...,x_{d-1}) = p(x_{1}|y)p(x_{2}|y)...p(x_{d}|y) \]

Parameterized by:

\[y\sim Bernoulli(\phi)\]
\[x_{j}|y=1 \sim Bernoulli(\phi_{j|y=1})\]
\[x_{j}|y=0 \sim Bernoulli(\phi_{j|y=0})\]

note:

\[k = \sum_{i=1}^{n}1\left\{y^{(i)}=1\right\}\]
\[s_{j} = \sum_{i=1}^{n}1\left\{x_{j}^{(i)}=1 \wedge y^{(i)}=1\right\}\]
\[t_{j} = \sum_{i=1}^{n}1\left\{x_{j}^{(i)}=1 \wedge y^{(i)}=0\right\}\]

We can write down the joint log likelihood of the data:

\[\begin{split} \begin{equation} \begin{split} \mathcal{L}(\phi, \phi_{j|y=1}, \phi_{j|y=0}) &= log\prod_{i=1}^{n}p(x^{(i)},y^{(i)})\\ &=\sum_{i=1}^{n}log(p(x^{(i)}, y^{(i)})) \\ &=\sum_{i=1}^{n}log(p(y^{(i)})\prod_{j=1}^{d}p(x_{j}^{(i)}|y^{(i)})) \\ &=\sum_{y^{(i)}=1}(log(\phi) + \sum_{j=1}^{d}log(p(x_{j}^{(i)}|y=1))) + \sum_{y^{(i)}=0}(log(1 - \phi) + \sum_{j=1}^{d}log(p(x_{j}^{(i)}|y=0))) \\ &=k\ log(\phi) + (n-k)log(1 - \phi) + \sum_{j=1}^{d}(s_{j}\ log(\phi_{j|y=1}) + (k-s_{j})log(1 - \phi_{j|y=1}) + t_{j}\ log(\phi_{j|y=0}) + (n -k - t_{j})log(1 - \phi_{j|y=0}) \end{split} \end{equation} \end{split}\]

Maximum likelihood estimate result:

\[\phi=\frac{k}{n}\]
\[\phi_{j|y=1} = \frac{s_{j}}{k}\]
\[\phi_{j|y=0} = \frac{t_{j}}{n-k}\]

Laplace Smoothing#

To predict unseen data, we add 1 to the numerator, add 2 to the denominator:

\[\phi_{j|y=1} = \frac{s_{j} + 1}{k + 2}\]
\[\phi_{j|y=0} = \frac{t_{j} + 1}{n-k + 2}\]

It’s actually Binary Bernoulli Naive Bayes and implemented in BernoulliNB.

It assumes each feature is binary-valued, if handed any other kind of data, a BernoulliNB instance may binarize its input.

import numpy as np

rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 4, 5])
from sklearn.naive_bayes import BernoulliNB

clf = BernoulliNB()
clf.fit(X, y)

clf.predict(X[2:3])
array([3])

Multinomial Naive Bayes#

MultinomialNB implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification.

This distribution is parametrized by vectors \(\theta_{y} = (\theta_{y1},...,\theta_{yn})\) for each class \(y\), where \(n\) is the number of features(in text classification, the size of the vocabulary) and \(\theta_{yi} = p(x_{i}|y)\).

The parameters \(\theta_{y}\) is estimated by a smoothed version of maximum likelihood, i.e. relative frequency counting:

\[\hat{\theta}_{yi} = \frac{N_{yi} + \alpha}{N_{y} + \alpha{n}}\]

where \(N_{yi}\) is the number of times feature \(i\) appears in a sample of class \(y\), and \(N_{y} = \sum_{i=1}^{n}N_{yi}\), \(\alpha \ge 0\) is smooth hyperparameter.

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(X, y)

clf.predict(X[2:3])
array([3])

Gaussian Naive Bayes#

When \(x_{j}\) is real-valued, we can assume the likelihood of the features is Gaussian:

\[p(x_{j}|y) = \frac{1}{\sqrt{2\pi\sigma_{y}^2}}\exp\left(-\frac{(x_{j} - \mu_{y})^2}{2\sigma_{y}^2}\right)\]

Just like the binary case, parameters \(\mu_{y}\) and \(\sigma_{y}\) are estimated using maximum likelihood.

This is Gaussian Naive Bayes and implemented in GaussianNB.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
print("Number of mislabeled points out of a total %d points : %d" 
      % (X_test.shape[0], (y_test != y_pred).sum()))
Number of mislabeled points out of a total 75 points : 4