DRAFT! This post is part of a series that is still work in progress and will be updated continuously.

In Part 2 of this series, we used basic probability theory, in particular the Bayes’ Rule, to derive the Bayesian Linear Regression model. We introduced the concept of prior beliefs about the parameters of a linear model and how to incorporate them into the model. We also discussed how to make predictions using the posterior distribution of the parameters. Now we will revisit curve-fitting from a probabilistic perspective instead of in terms of error minimization.

Note: This post is based on the book “Pattern Recognition and Machine Learning” by Christopher M. Bishop 1 and slides from the lecture “Advanced Machine Learning” by Prof. Dr. Bastian Leibe 2

curve-fitting revisited

Given training set $\boldsymbol{X} = (x_1, x_2, \ldots, x_N)^T$ and corresponding target values $\boldsymbol{t} = (t_1, t_2, \ldots, t_N)^T$, we want to make predictions for target variable $t$ given new value for input variable $x$. The uncertainty over the value of the target variable is expressed by a probability distribution:

$$p(t|x, \bold{w}, \beta)$$

Further, we assume following:

  1. The data is independent and identically distributed (i.i.d.).

  2. The values $y$ for our target function are generated by adding noise to the function estimate: $$t = y(x, \bold{w}) + \epsilon$$

  3. The noise $\epsilon$ is Gaussian distributed with $\mu = y(x,\bold{w})$ and precision $\beta$:

$$ p(t|x, \bold{w}, \beta) = \mathcal{N}(t|y(x, \bold{w}), \beta^{-1}) $$

With these assumptions, we can now write the likelihood function for the target values $\bold{t}$ given the input values $\bold{X}$ and the parameters $\bold{w}$ and $\beta$:

$$ p(\bold{t}|\bold{X}, \bold{w}, \beta) = \prod_{n=1}^{N} \mathcal{N}(t_n|y(x_n, \bold{w}), \beta^{-1}) \\ = \prod_{n=1}^{N} \mathcal{N}(t_n|\bold{w}^T\bold{\phi}(x_n), \beta^{-1}) $$

where $\bold{w}^T\bold{\phi}(x_n)$ is the generalized lienar regressiion function. We want to maximize the likelihood function with respect to the parameters $\bold{w}$ and $\beta$. As in Part 2, we can take the logarithm of the likelihood function and minimize the negative log-likelihood to get the Maximum Likelihood Estimation (MLE) solution.

So, first let us define log-likelihood-function:

$$ \log{p(\bold{t}|\bold{X}, \bold{w}, \beta)} = \log{\prod_{n=1}^{N} \mathcal{N}(t_n|y(x_n, \bold{w}), \beta^{-1})} $$

$$ = \sum_{n=1}^{N} \log{\mathcal{N}(t_n|y(x_n, \bold{w}), \beta^{-1})} \\ $$

$$ = \sum_{n=1}^{N} \left[ \log \left( \frac{\sqrt{\beta}}{\sqrt{2\pi}} \right) - \frac{\beta}{2} \left( y(\mathbf{x}_n, \mathbf{w}) - t_n \right)^2 \right] $$

$$ = \green{-\frac{\beta}{2}} \sum_{n=1}^{N} \orange{\left( t_n - y(\mathbf{x}_n, \mathbf{w}) \right)^2} \green{+ \frac{N}{2} \log \beta - \frac{N}{2} \log(2\pi)} $$

Remember, with respect to $\bold{w}$ and $\beta$, we want to maximize the likelihood by minimizing the negative log-likelihood. As always in this series, we have to set the derivative to zero. So, when we take the partial derivative w.r.t $\bold{w}$, we can ignore the the last two terms in green. At the same time, the first green term is a scaling constant, so we could also set it to $\frac{1}{2}$. The remaining orange term is the same as the sum of squared errors in the frequentist approach. So, we can see that the MLE solution is equivalent to minimizing the sum of squared errors.

In Part 1 we have let the reasoning for the choice of the sum-of-squared errors open, but now we can see that the error function arises as a consequence of the assumption of Gaussian noise in the target values.

maximizing likelihood

Let us go ahead and calculate the partial derivative of the log-likelihood function w.r.t $\bold{w}$ to confirm this:

$$ \frac{\partial}{\partial \bold{w}} \log{p(\bold{t}|\bold{X}, \bold{w}, \beta)} = -\beta \sum_{n=1}^{N} \left( t_n - y(\mathbf{x}_n, \mathbf{w}) \right) \bold{\phi}(\mathbf{x}_n) = 0 $$

Setting the derivative to zero, we get the following equation:

$$ 0 = -\beta \sum_{n=1}^{N} \left( t_n - \bold{w}^T\bold{\phi}(x_n) \right) \bold{\phi}(\mathbf{x}_n) $$

$$ \iff \sum{t_n \phi(\mathbf{x}_n)} = \left( \sum{\phi(\mathbf{x}_n) \phi(\mathbf{x}_n)}^T \right) \bold{w} $$

$$ \iff \bold{\Phi}^T \bold{t} = \bold{\Phi} \bold{\Phi}^T \bold{w} $$

$$ \iff \bold{w}_{ML} = \left( \bold{\Phi} \bold{\Phi}^T \right)^{-1} \bold{\Phi}^T \bold{t} $$

where $\bold{\Phi}$ is the feature transformation matrix with elements $\phi_{nj} = \phi_j(\mathbf{x}_n)$. Now let us determine the precision parameter $\beta$ by taking the derivative of the log-likelihood function, this time w.r.t $\beta$:

$$ \frac{\partial}{\partial \beta} \log{p(\bold{t}|\bold{X}, \bold{w}, \beta)} = - \frac{1}{2} \sum_{n=1}^{N} \left( t_n - \bold{w}^T\bold{\phi}(x_n) \right)^2 + \frac{N}{2\beta} = 0 $$

$$ \iff \frac{1}{\beta_{ML}} = \frac{1}{N} \sum_{n=1}^{N} \left( t_n - \bold{w}^T\bold{\phi}(x_n) \right)^2 $$

→ Least-squares regression is equivalent to Maximum Likelihood under the assumption of Gaussian noise!

predictive distribution

Now that we have found the MLE solution for the parameters $\bold{w}$ and $\beta$, we can make predictions for new input values $\mathbf{x}$ by computing the predictive distribution:

$$ p(t | \bold{x}, \bold{w}_{\text{ML}}, \beta_{\text{ML}}) = \mathcal{N}(t | y(\bold{x}, \bold{w}_{\text{ML}}), \beta_{\text{ML}}^{-1}) $$

which means that now, instead of giving a point estimate, we can now also give an estimate of the uncertainty in our prediction.

maximum posterior

Now, moving on the Bayesian approach, we introduce a prior distribution over the polynomial coefficients $\bold{w}$. For simplicity, we assume a Gaussian prior distrubition with a zero-mean and following form:

$$ p(\bold{w} | \alpha) = \mathcal{N}(\bold{w} | \bold{0}, \alpha^{-1} \bold{I}) = \left( \frac{\alpha}{2\pi} \right)^{(\frac{M+1}{2})} \exp \left( -\frac{\alpha}{2} \bold{w}^T \bold{w} \right) $$

The new hyperparameter $\alpha$ controls the strength of the prior by scaling the identity matrix $\bold{I}$ resulting in the covariance matrix. The posterior distribution is then given by:

$$ p(\mathbf{w} \mid \mathbf{X}, \mathbf{t}, \beta, \alpha) \propto p(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta) p(\mathbf{w} \mid \alpha) $$

This means that we can now determine $\bold{w}$ by maximizing the posterior distribution - this is called maximum-a-posteriori (MAP) estimation. Once again (as always), we can take the logarithm of the posterior distribution and maximize it by minimizing the negative log-posterior.

$$ -\log p(\mathbf{w} \mid \mathbf{X}, \mathbf{t}, \beta, \alpha) \propto \blue{-\log p(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta)} \orange{- \log p(\mathbf{w} \mid \alpha)} $$

$$ \blue{-\log p(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta)} = \frac{\beta}{2} \sum_{n=1}^{N} \left( t_n - \bold{w}^T\bold{\phi}(x_n) \right)^2 \green{+ \frac{N}{2} \log \beta + \frac{N}{2} \log(2\pi)} $$

$$ \orange{-\log p(\mathbf{w} \mid \alpha)} = \frac{\alpha}{2} \bold{w}^T \bold{w} \green{+ \frac{M+1}{2} \log \alpha + \frac{M+1}{2} \log(2\pi)} $$

The green terms are constants and are therefore irrelevant for the optimization. The MAP is then the solution for the sum of the blue and the range term, which is equivalent to the sum of squared errors (blue) and the regularization term for $\lambda = \frac{\alpha}{\beta}$ (orange).

→ Maximizing the posterior distribution is equivalent to minimizing the sum of squared errors with a regularization term!

Solution:

$$ \frac{\partial}{\partial \bold{w}} \log p(\bold{w} | \bold{X}, \bold{t}, \beta, \alpha) $$

$$ = -\beta \sum_{n=1}^{N} \left( t_n - \bold{w}^T\bold{\phi}(x_n) \right) \bold{\phi}(\mathbf{x}_n) + \alpha \bold{w} = 0 $$

$$ \iff \sum_{n=1}^{N} t_n \bold{\phi}(\mathbf{x}_n) $$

$$ = \left( \sum_{n=1}^{N} \bold{\phi}(\mathbf{x}_n) \bold{\phi}(\mathbf{x}_n)^T \right) \bold{w} + \frac{\alpha}{\beta} \bold{w} $$

$$ \iff \bold{\Phi} \bold{t} = \left( \bold{\Phi} \bold{\Phi}^T + \frac{\alpha}{\beta} \bold{I} \right) \bold{w} $$

$$ \iff \bold{w}_{MAP} = \left( \bold{\Phi} \bold{\Phi}^T + \frac{\alpha}{\beta} \bold{I} \right)^{-1} \bold{\Phi} \bold{t} $$

→ The effect of regularization is to keep the inverse well-conditioned!

results

By revisiting curve-fitting from a probabilistic perspective, we have come to a better understanding of what linear regression actually means. Least-squares regression is equivalent to Maximum Likelihood Estimation under the assumption of Gaussian noise, which allows us to give an uncertainty estimate on the prediction. From (Part 1)[/posts/blr-part-1], we know that Maximum Likelihood tends towards overfitting. Ridge regression is equivalent to Maximum-a-Posteriori estimation with a Gaussian prior on the parameters $\bold{w}$, which allows us to control the complexity of the model.

For a full Bayesian Estimation, we would have to integrate over all possible values of $\bold{w}$.

references


  1. Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006. ↩︎

  2. Prof. Dr. Bastian Leibe, Advanced Machine Learning, Link RWTH Aachen University, 2022. ↩︎