Linear Regression#

Note

Linear Regression = Linear Model + Mean Square Loss
Linear Regression has nice geometric and probabilistic Interpretations.

Model#

Suppose \(x \in \mathbb{R}^{d}\), \(y \in \mathbb{R}\). Linear model is:

\[h(x) = w^{T}x + b\]

For simplicity, let:

\[x := [x,1] \in \mathbb{R}^{d + 1}\]
\[\theta := [w, b] \in \mathbb{R}^{d + 1}\]

Then linear model could be write as:

\[h(x) = \theta^{T}x\]

jupyter

Loss#

Loss Function is mean square loss:

\[J(\theta) = \frac{1}{2}\sum_{i=1}^{n}(h(x^{(i)}) - y^{(i)})^{2}\]

Update Rule#

Gradient Descent:

\[\theta \to \theta - \alpha\nabla{J(\theta)}\]

Gradient of Linear Regression:

\[\begin{split} \begin{equation} \begin{split} \frac{\partial }{\partial \theta_{j}}J(\theta) &= \frac{\partial }{\partial \theta}_{j}\frac{1}{2}\sum_{i=1}^{n}(h(x^{(i)}) - y^{(i)})^2 \\ &=\sum_{i=1}^{n}(h(x^{(i)}) - y^{(i)})\cdot\frac{\partial }{\partial \theta_{j}}(h(x^{(i)}) - y^{(i)})\\ & =\sum_{i=1}^{n}(h(x^{(i)}) - y^{(i)})\cdot{x_{j}}^{(i)} \end{split} \end{equation} \end{split}\]

Combine all dimensions:

\[\theta \to \theta - \alpha\sum_{i=1}^{n}(h(x^{(i)}) - y^{(i)})\cdot{x^{(i)}} \]

Write in matrix form:

\[\theta \to \theta - \alpha{X^{T}}(X\theta-y) \]

where \(X \in \mathbb{R}^{n\times{d}}, y \in \mathbb{R}^{n}\).

Analytic Solution#

From above, we have:

\[\nabla{J(\theta)} = X^{T}X\theta - X^{T}y\]

If \(X^{T}X\) is invertible:

\[\theta = (X^{T}X)^{-1}X^{T}y\]

Else the equation also has a solution:

\[X^{T}X\theta=0 \Rightarrow \theta^{T}X^{T}X\theta = (X\theta)^{T}X\theta=0 \Rightarrow X\theta = 0\]
\[\mbox{null}(X^{T}X) = \mbox{null}(X) \Rightarrow \mbox{range}(X^{T}X) = \mbox{range}(X^{T})\]

Examples#

from sklearn.datasets import load_boston

X, y = load_boston(return_X_y=True)
X.shape, y.shape
((506, 13), (506,))
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(X, y)
LinearRegression()
from sklearn.metrics import mean_squared_error
mean_squared_error(y, reg.predict(X))
21.894831181729202

Geometric Interpretation#

Denote linear space \(S = \mbox{span}\left \{\mbox{columns of } X \right \}\), linear combination of \(S\) should be written as \(X\theta\).

\(X\theta\) is the projection of \(y\) on \(S \Leftrightarrow\) \(X\theta - y\) orthogonal with \(S \Leftrightarrow\) orthogonal with columns of \(X \Leftrightarrow X^{T}(X\theta - y)=0\)

Linear regression \(\Leftrightarrow\) Finding the projection of \(y\) on \(S\).

Probabilistic Interpretation#

Assume targets and inputs are related via:

\[y^{(i)} = \theta^{T}x^{(i)} + \epsilon^{(i)}\]

where \(\epsilon^{(i)}\) is the error term and distributed IID according to Gaussian with mean 0 and variance \(\sigma^{2}\):

\[p(\epsilon^{(i)}) = \frac{1}{\sqrt{2\pi}\sigma}exp\left (-\frac{(\epsilon^{(i)})^{2}}{2\sigma^{2}}\right )\]

This is equivalent to say(we should denote that \(\theta\) is not a random variable here):

\[p(y^{(i)}|x^{(i)}; \theta) = \frac{1}{\sqrt{2\pi}\sigma}exp\left ( -\frac{(y^{(i)} - \theta^{T}x^{(i)})^{2}}{2\sigma^{2}}\right)\]

The likelihood function:

\[\begin{split} \begin{equation} \begin{split} L(\theta) &= \prod_{i=1}^{n}p(y^{(i)}|x^{(i)}; \theta) \\ &= \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}exp\left ( -\frac{(y^{(i)} - \theta^{T}x^{(i)})^{2}}{2\sigma^{2}}\right) \end{split} \end{equation} \end{split}\]

Maximize the log likelihood:

\[\begin{split} \begin{equation} \begin{split} \log(L(\theta)) &= \log\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}\exp\left ( -\frac{(y^{(i)} - \theta^{T}x^{(i)})^{2}}{2\sigma^{2}}\right) \\ &= \sum_{i=1}^{n}\log\frac{1}{\sqrt{2\pi}\sigma}\exp\left ( -\frac{(y^{(i)} - \theta^{T}x^{(i)})^{2}}{2\sigma^{2}}\right) \\ &= n\log\frac{1}{\sqrt{2\pi}\sigma} - \frac{1}{2\sigma^{2}}\sum_{i=1}^{n}(y^{(i)} - \theta^{T}x^{(i)})^{2} \end{split} \end{equation} \end{split}\]

hence, maximizing the log likelihood gives the same answer as minimizing:

\[\frac{1}{2}\sum_{i=1}^{n}(y^{(i)} - \theta^{T}x^{(i)})^{2} = J(\theta)\]

Linear regression \(\Leftrightarrow \) Maximum Likelihood Estimate given Gaussian error.