Logistic Regression#

Note

Logistic Regression = Logistic model + binary cross entropy loss.
For multi-class classification problem, we can use softmax regression.

Model#

For binary classification problem where \(x \in \mathbb{R}^{d}\), \(y \in \left\{0, 1\right\}\), we could approach the classification problem using linear regression ignoring the fact that \(y\) is discrete. Howerver, it is easy to construct examples that performs poorly, it doesn’t make sense for \(h(x)\) outside \(\left[0, 1\right]\).

To fix this, we used the logistic function(sigmoid) to force the result in \(\left[0, 1\right]\) after the affine transformation:

\[h(x) = \frac{1}{1 + \mbox{exp}(-\theta^{T}x)} = \sigma(\theta^{T}x)\]

Entropy#

Self information \(I(x)\) indicates the amount of information an event \(x\) to happen that satisfies:

  1. \(I(x) \ge 0\)

  2. \(\text{if }p(x_{1}) > p(x_{2}) \text{, then } I(x_{1}) < I(x_{2})\)

  3. \(I(x_{1}, x_{2}) = I(x_{1}) + I(x_{2}) \text{ for independent }x_{1},x_{2}\)

This leads to \(I(x) = -\log_{r}p(x)\), for convenient \(I(x) := -\log{p(x)}\).

While self-information measures the information of a event, entropy measures the information of a random variable:

\[ H(X) = E(I(x)) = E(-\log{p(x)}) = -\sum_{x \in \mathcal{X}}\log{p(x)} \]

It is exactly the optimal encoding length of \(X\).

Cross entropy \(H(p, q)\) is the encoding length of \(p\) by optimal encoding of \(q\):

\[H(p,q)=E_{p}\left[-\log{q(x)}\right] = -\sum_{x}p(x)\log{q(x)}\]

Fix \(p\), the closer \(q\) is to \(p\), the less is \(H(p,q)\). We can use \(H(p,q)\) to define the distance from \(q\) to \(p\).

Loss#

Using the definition of cross entropy above, we interpret label \(y^{(i)}\) as the distribution \(p(y^{(i)}|x^{(i)}) = 1, p(1 - y^{(i)}|x^{(i)})=0\).

In the same manner, interpret the hypothesis as \(q(y=1|x^{(i)}) = h(x^{(i)}), q(y=0|x^{(i)}) = 1 - h(x^{(i)})\).

Cross Entropy Loss from \(q\) to \(p\) measures the distance from hypothesis to label:

\[l_{\theta}(x^{(i)}) = -y^{(i)}\log(h(x^{(i)})) - (1 - y^{(i)})\log(1 - h(x^{(i)}))\]

Sum them up derive the cross entropy loss logistic regression uses:

\[J(\theta) = \sum_{i=1}^{n}\left[-y^{(i)}\log(h(x^{(i)})) - (1 - y^{(i)})\log(1 - h(x^{(i)}))\right]\]

Probabilistic Interpretation#

As we supposes:

\[p(y|x) = h(x)^{y}\cdot(1 - h(x))^{1 - y} \]

Log likelihood of the dataset:

\[\begin{split} \begin{equation} \begin{split} L(\theta) &= \log\prod_{i=1}^{n} h(x^{(i)})^{y^{(i)}}\cdot(1 - h(x^{(i)}))^{1 - y^{(i)}} \\ &= \sum_{i=1}^{n}y^{(i)}\log(h(x^{(i)})) + (1 - y^{(i)})\log(1 - h(x^{(i)})) \end{split} \end{equation} \end{split}\]

So logistic regression \(\Leftrightarrow\) MLE if we see \(h(x)\) as \(p(y=1|x)\).

Update Rule#

Gradient of logistic regression:

\[\begin{split} \begin{equation} \begin{split} \frac{\partial }{\partial \theta_{j}}J(\theta ) &= \sum_{i=1}^{n} \left (-y^{(i)}\frac{1}{\sigma(\theta^{T}x^{(i)})} + (1 - y^{(i)})\frac{1}{1 - \sigma(\theta^{T}x^{(i)})} \right )\frac{\partial }{\partial \theta_{j}}\sigma(\theta^{T}x^{(i)})\\ &=\sum_{i=1}^{n} \left (-y^{(i)}\frac{1}{\sigma(\theta^{T}x^{(i)})} + (1 - y^{(i)})\frac{1}{1 - \sigma(\theta^{T}x^{(i)})} \right )\sigma(\theta^{T}x^{(i)})(1-\sigma(\theta^{T}x^{(i)}))\frac{\partial }{\partial \theta_{j}}\theta^{T}x^{(i)} \\ &=\sum_{i=1}^{n}(h_{\theta}(x^{(i)}) - y^{(i)})x_{j}^{(i)} \end{split} \end{equation} \end{split}\]

Combine all dimensions:

\[\theta \to \theta - \alpha\sum_{i=1}^{n}(h(x^{(i)}) - y^{(i)})\cdot{x}^{(i)} \]

Write in matrix form:

\[\theta \to \theta - \alpha{X}^{T}(\sigma({X}{\theta})-{y}) \]

where \({X} \in \mathbb{R}^{n\times{d}}, {y} \in \mathbb{R}^{n}\).

Examples#

from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0, max_iter=5000)
clf.fit(X, y)
clf.predict(X[:2, :])
array([0, 0])
# score return the mean accuracy on the given test data and labels.
clf.predict_proba(X[:2, :]), clf.score(X, y)
(array([[1.00000000e+00, 3.16211740e-14],
        [9.99996140e-01, 3.86002382e-06]]),
 0.9578207381370826)

Softmax Regression#

For multi-class classification, we start off with a simple image classification problem, each input consists of a \(2\times{2}\) grayscale image, represent each pixel with a scalar, giving us features \(\left\{x_{1},x_{2},x_{3}, x_{4}\right\}\). assume each image belong to one among the categories “cat”, “chiken” and “dog”.

We have a nice way to represent categorical data: the one-hot encoding, for our problem, “cat” represents by \((1,0,0)\), “chicken” by \((0, 1, 0)\), “dog” by \((0, 0, 1)\).

To estimate the conditional probabilities of all classes, we need a model with multiple outputs, one per class:

\[o_{1} = x_{1}w_{11} + x_{2}w_{12} + x_{3}w_{13} + x_{4}w_{14}\]
\[o_{2} = x_{1}w_{21} + x_{2}w_{22} + x_{3}w_{23} + x_{4}w_{24}\]
\[o_{3} = x_{1}w_{31} + x_{2}w_{32} + x_{3}w_{33} + x_{4}w_{34}\]

depict as:

jupyter

We would like \(\hat{y}_{j}\) to be interpreted as probability that a given item belong to class \(j\), to transform our current outputs \(\left\{o_{1},o_{2},o_{3},o_{4}\right\}\) to probability distribution \(\left\{\hat{y}_{1},\hat{y}_{2},\hat{y}_{3},\hat{y}_{4}\right\}\), just use the softmax operation:

\[\hat{y}_{j} = \frac{\exp(o_{j})}{\sum_{k}\exp(o_{k})}\]

As logistic regression, we use the cross entropy loss:

\[H(y,\hat{y}) = -\sum_{k}y_{j}\log\hat{y}_{j} = -\log\hat{y}_{\text{category of y}}\]

Now complete the construction of softmax regression.

"""multi-class classification problem"""
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
y
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
"""
Set multi_class='multinomial' in LogisticRegression
"""
softmax_reg = LogisticRegression(multi_class="multinomial", solver="lbfgs", C=10, max_iter=1000)
softmax_reg.fit(X, y)
softmax_reg.predict(X[:3, :])
array([0, 0, 0])