DRAFT! This post is part of a series that is still work in progress and will be updated continuously.
In Part 1 of this series, we introduced Linear Regression from the frequentist approach, where we estimated the assumed fixed but unknown parameters of a linear model by minimizing the sum of squared errors, which is equivalent to maximizing the likelihood of the data - the Maximum Likelihood Estimation (MLE). This is a point estimate of the parameters, which does not account for the uncertainty in the estimates. In this post, we will introduce Bayesian Linear Regression, where we estimate the parameters of a linear model while incorporating prior beliefs about them. We will also discuss how to make predictions using the posterior distribution of the parameters.
Note: This post is based on the book “Pattern Recognition and Machine Learning” by Christopher M. Bishop 1 and slides from the lecture “Advanced Machine Learning” by Prof. Dr. Bastian Leibe 2
why
Working with complex models we quickly saw that linear regression tends to overfit the data. This is because the model is too flexible and can fit the noise in the data. We can use regularization techniques like Ridge and Lasso regression to prevent overfitting. However, these methods do not provide a measure of uncertainty in the estimates. We would like to achieve following goals:
- Work with complex models.
- Prevent overfitting.
- Avoid need for validation on seperate test data.
- Obtain uncertainty estimates in the predictions.
To get there, we have to understand what it actually means to do linear regression and regularization.
probability rules
Let’s revise some basic rules of probability theory.
- Sum Rule (or marginalization): $P(X) = \sum_Y P(X, Y)$
- Product Rule (or chain rule): $P(X, Y) = P(X|Y)P(Y)$
From those, we can derive the Bayes’ Rule: $$P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}$$ $$p(X)=\sum p(X|Y)p(Y)$$
Using these, we get the Bayes Decision Theory, which is the foundation of Bayesian Learning. To understand it, let’s consider a simple example.
In the visualization above, we have two different classes, $C_1$ and $C_2$, and a measurement $x$.
- Likelihood: The probability of observing the data given that the data belongs to class $C$, both of which are a gaussian with the same standard deviation but different means.
- Likelihood × Prior: The product of the likelihood and the prior (prior beliefs about the classes), in our case 0.7 for $C_1$ and 0.3 for $C_2$.
- Posterior: The probability of the class given the data, which is the normalized product of the likelihood and the prior.
The green line shows the decision boundary, which is the point where the posterior probability of the two classes is equal. If the posterior probability of class $C_1$ is greater than class $C_2$, we predict class $C_1$ and vice versa. The distributions shown above are Gaussian (or Normal) distributions, which has the parameters $\mu$ (mean) and $\sigma^2$ (variance) and has following form:
$$ \mathcal{N}(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) $$
There also is the multidimensional case, which uses a covariance matrix $\boldsymbol{\Sigma}$ instead:
$$ \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{D/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right) $$
Often it makes sense to use the inverse of the covariance matrix, which is called the precision matrix $\boldsymbol{\Lambda} = \boldsymbol{\Sigma}^{-1}$. We can therefore write the Gaussian as:
$$ \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Lambda}^{-1}) = \frac{1}{(2\pi)^{D/2}|\boldsymbol{\Lambda}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Lambda} (\mathbf{x} - \boldsymbol{\mu})\right) $$
Using the probability rules and the Gaussian distribution, we can reviist the MLE and computation of the likelihood. The probability for a single data point is $p(x_n | \theta)$. We assume that all data points are independent and identically distributed (i.i.d.), which allows us to write the likelihood as:
$$ L(\theta) = p(\mathbf{X} | \theta) = \prod_{n=1}^N p(x_n | \theta) $$
The negative log-likelihood is then:
$$ E(\theta) = -\log L(\theta) = -\sum_{n=1}^N \log p(x_n | \theta) $$
To estimate the parameters, we once again try to maximize the likelihood (MLE), but this time by minimizing the negative log-likelihood. How do we minimize? If you read part 1, you know the answer: Always take the derivative and set it to zero.
$$ \frac{\partial }{\partial \theta} E(\theta) = -\frac{d}{d\theta} \left( \sum_{n=1}^{N} \ln p(x_n|\theta) \right) \\ = -\sum_{n=1}^{N} \frac{\frac{d}{d\theta} p(x_n|\theta)}{p(x_n|\theta)} \stackrel{!}{=} 0 $$
bayesian learning approach
In the bayesian view, we consider the parameter vector $\theta$ to be a random variable. When we estimate it, what we essentially compute is
$$ p(x|X) = \int p(x, \theta | X) d\theta $$
The term $p(x, \theta | X)$ is equal to $p(x | \theta, X) p(\theta | X)$, but we can assume that given $\theta$, this doesn’t depent on $X$ anymore. Therefore, we can write:
$$p(x|X) = \int p(x|\theta) p(\theta|X) d\theta$$
Using Bayes’ rule, we can write for the posterior probability $p(\theta|X)$ following:
$$ p(\theta|X) = \frac{p(X|\theta) p(\theta)}{p(X)} = \frac{p(\theta)}{p(X)}L(\theta) $$
Using the sum rule for the denominator $p(X)$, we get:
$$ p(X) = \int p(X|\theta) p(\theta) d\theta = \int L(\theta) p(\theta) d\theta $$
Inserting both into $p(x|X)$ above, we get:
$$ \orange{p(x|X)} = \int \frac{\blue{p(x|\theta)} \pink{L(\theta)} \purple{p(\theta)}} {\green{\int L(\theta) p(\theta) d\theta}} $$
Now we have following:
- $\orange{p(x|X)}$: The posterior probability of the data.
- $\blue{p(x|\theta)}$: Estimate for $x$ based on parametric form $\theta$.
- $\pink{L(\theta)}$: Likelihood of the parametric form $\theta$ given the data set $X$.
- $\purple{p(\theta)}$: Prior probability of the parametric form $\theta$.
- $\green{\int L(\theta) p(\theta) d\theta}$: Normalization term integrating over all possible values of $\theta$.
results
→ The probability $p(\theta|X)$ makes the dependency of the estimate on the data explicit.
→ If $p(\theta|X)$ is very small everywhere, but is large for one $\tilde{\theta}$, then $p(\theta|X) \approx p(\tilde{\theta}|X)$, but the more uncertain we are about $\theta$, the more we average over all possible values of the parameters.
Generally speaking, the integration over all possible values of $\theta$ is computationally infeasible (or rather, only possible stochastically for complex models).
conclusion
We have revisited MLE from a probability theory perspective, but still in terms of error minimization. In the next part, we will express our uncertainty over the target variable using a probability distribution $p(t|x, \bold{w}, \beta)$.