Linear Regression is a fundamental supervised learning approach for modeling the relationship between one or more explanatory variables (often called features or predictors) and a continuous target variable. The goal is to find a function (usually linear) that describes this relationship as accurately as possible.
Published on March 02, 2023 by Tung Nguyen
Machine Learning Probability and Statistics
3 min READ
Linear Regression is a fundamental supervised learning approach for modeling the relationship between one or more explanatory variables (often called features or predictors) and a continuous target variable. The goal is to find a function (usually linear) that describes this relationship as accurately as possible.
The simplest form, Simple Linear Regression, deals with one explanatory variable [ x ] and one response variable [ y ], aiming to fit a line:\
\[ y \approx \beta_0 + \beta_1 x. \]
More generally, Multiple Linear Regression deals with multiple features/predictors:
\[ \mathbf{y} \approx \mathbf{X}\boldsymbol{\beta}, \]
where
We assume a linear model of the form [ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, ] where [\boldsymbol{\varepsilon}] is the noise or error term, often assumed to be normally distributed with mean 0 and variance [\sigma^2]: [ \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}). ]
\[ y_i \sim \mathcal{N}(\mathbf{x}_i^T \boldsymbol{\beta}, \sigma^2), \]
where [\mathbf{x}_i] is the [i]-th row of [\mathbf{X}].
\[ p(\mathbf{y} \mid \boldsymbol{\beta}, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2}{2\sigma^2}\right). \]
Maximizing the likelihood is equivalent to maximizing the log-likelihood:
\[\ell(\boldsymbol{\beta}, \sigma^2) = \ln p(\mathbf{y} \mid \boldsymbol{\beta}, \sigma^2)\]
\[= -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2.\]
We typically maximize [\ell] w.r.t. [\boldsymbol{\beta}] (and can separately solve for [\sigma^2]).
Ignoring terms that do not depend on [\boldsymbol{\beta}], we want to minimize:
\[\sum_{i=1}^n (y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}).\]
Taking the gradient of this quantity w.r.t. [\boldsymbol{\beta}] and setting it to zero:
\[ \frac{\partial}{\partial \boldsymbol{\beta}} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = -2 \mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = 0, \]
which yields
\[ \mathbf{X}^T \mathbf{X} \boldsymbol{\beta} = \mathbf{X}^T \mathbf{y}. \]
If [\mathbf{X}^T \mathbf{X}] is invertible (i.e., [\mathbf{X}] has full column rank), the MLE solution is:
\[ \boldsymbol{\hat{\beta}}_{\text{MLE}} = (\mathbf{X}^T \mathbf{X})^{-1}\mathbf{X}^T \mathbf{y}. \]
To find the MLE for [\sigma^2], substitute [\boldsymbol{\hat{\beta}}_{\text{MLE}}] into the log-likelihood and solve. The result is:
\[ \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n \bigl(y_i - \mathbf{x}_i^T \boldsymbol{\hat{\beta}}_{\text{MLE}}\bigr)^2. \]
(This could also be [\frac{1}{n-d}] if we use an unbiased sample estimator, but in standard MLE, we divide by [n].)
The Ordinary Least Squares approach seeks to minimize the sum of squared errors (residuals):
\[ \min_{\boldsymbol{\beta}} \; \sum_{i=1}^n (y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2 = \min_{\boldsymbol{\beta}} \; (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}). \]
Following exactly the same derivation (taking the derivative w.r.t. [\boldsymbol{\beta}] and setting to zero), we arrive at the normal equations:
\[ \mathbf{X}^T \mathbf{X} \boldsymbol{\beta} = \mathbf{X}^T \mathbf{y}. \]
Thus, the solution is the same:
\[ \boldsymbol{\hat{\beta}}_{\text{OLS}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}. \]
Therefore, MLE (under the assumptions of normally distributed errors) and OLS yield the same parameter estimates.
When we assume the errors [\boldsymbol{\varepsilon}] are i.i.d. Gaussian with zero mean and variance [\sigma^2], maximizing the likelihood is identical to minimizing the sum of squared residuals. Consequently, MLE and OLS solutions coincide.