DRAFT! This post is part of a series that is still work in progress and will be updated continuously.
In Part 3 of this series, we revisited curve-fitting from a probabilistic perspective instead of in terms of error minimization. We also discussed how to maximize the likelihood function with respect to the parameters to get the Maximum Likelihood Estimation (MLE) solution, showed that it is equivalent to minimizing the sum of squared errors, and similarly showed that the MAP solution is equivalent to Ridge regression. Now we will evaluate the predictive distribution of the target variable given the input variable and the training data.
Note: This post is based on the book “Pattern Recognition and Machine Learning” by Christopher M. Bishop 1 and slides from the lecture “Advanced Machine Learning” by Prof. Dr. Bastian Leibe 2
predictive distribution
Once again, given are the training set $\boldsymbol{X} = (x_1, x_2, \ldots, x_N)^T$ and corresponding target values $\boldsymbol{t} = (t_1, t_2, \ldots, t_N)^T$. We want to make predictions for the target variable $t$ given a new value for the input variable $\bold{x}$ based on the training data. The uncertainty over the value of the target variable is expressed by a probability distribution:
$$ p(t|x, \bold{X}, \bold{t}) = \int \blue{p(t|x, \bold{w})} \space \orange{p(\bold{w}|\bold{X}, \bold{t})} \space d\bold{w} $$
The blue term $\blue{p(t|x, \bold{w})}$ is the noise distribution for the target variable. We assume Gaussian noise here:
$$ \blue{p(t|x, \bold{w})} = \mathcal{N}(t|\bold{w}^T\bold{\phi}(x), \beta^{-1}) $$
The orange term $\orange{p(\bold{w}|\bold{X}, \bold{t})}$ is the posterior distribution of the parameters given the training data. We have already derived this in Part 3 for the MAP solution. Under those assumptions, the posterior distribution is a Gaussian and can be eveluated analytically:
$$ p(t|x, \bold{X}, \bold{t}) = \mathcal{N}(t \mid m(x), s^2(x)) $$
$$ m(x) = \beta \phi(x)^T \bold{S}_N \sum \phi(x_n) t_n $$
$$ s^2(x) = \beta^{-1} + \phi(x)^T \bold{S}_N \phi(x) $$
$$ \bold{S}^{-^1} = \alpha \bold{I} + \beta \sum \phi(x_n) \phi(x_n)^T $$
where $\bold{S}$ is the regularized covariance matrix of the posterior distribution of the parameters, $\bold{\Phi}$ is the feature transformation matrix, and $\alpha$ and $\beta$ are the hyperparameters of the prior distribution. The predictive distribution is a Gaussian with mean $m(x)$ and variance $s^2(x)$. When we inspect the variance, we have two terms: the first term is the noise term (expressed already in ML) and the second term is the model uncertainty term in the parameters $\bold{w}$. The model uncertainty term is high when the input data is far from the training data and low when the input data is close to the training data. One important difference to the previous example is that now the uncertainty is not constant but varies with test point $x$.
loss functions
To actually estimate a function value $y_t$ for a new point $x_t$, we need a loss function.
$$ L: \mathbb{R} \times \mathbb{R} \rightarrow \mathbb{R} \\ (t_n, y(\bold{x}_n)) \rightarrow L(t_n, y(\bold{x}_n)) $$
The optimal prediction is the prediction that minimizes the expected loss:
$$ \mathbb{E}[L] = \int \int L(t, y(x)) p(t|x, \bold{X}, \bold{t}) \space d\bold{x} dt $$
As before, let us choose the squared loss function $L(t,y(\bold{x})) = \{y(x)- t\}^2$:
$$ \mathbb{E}[L] = \iint { y(\mathbf{x}) - t }^2 p(\mathbf{x}, t) , d\mathbf{x} , dt $$
$$ \frac{\partial \mathbb{E}[L]}{\partial y(\mathbf{x})} = 2 \int \{ y(\mathbf{x}) - t \} p(\mathbf{x}, t) , dt \stackrel{!}{=} 0 $$
$$ \iff \int t p(\mathbf{x}, t) , dt = y(\mathbf{x}) \int p(\mathbf{x}, t) , dt $$
$$ \iff y(\mathbf{x}) = \int t \frac{p(\mathbf{x}, t)}{p(\mathbf{x})} , dt = \int t p(t \mid \mathbf{x}) , dt $$
$$ \blue{\iff y(\mathbf{x}) = \mathbb{E}[t \mid \mathbf{x}]}$$
This means that under squared loss, the optimal prediction is the mean $\blue{\mathbb{E}[t \mid \mathbf{x}]}$ of the predictive distribution $p(t \mid \mathbf{x})$. This is also called mean prediction.
Using this result, we can now compute the optimal prediction for a new point $x_t$ for our generalized linear regression function under square loss:
$$ y(x) = \int t \mathcal{N}(t|\bold{w}^T\bold{\phi}(x), \beta^{-1}) dt = \blue{\bold{w}^T\bold{\phi}(x)} $$
minkowski loss
The squared loss is one of many possible loss functions and is a special case of the Minkowski loss function. The Minkowski loss function is defined as:
$$ L_q(y, t) = |y(\bold{x}) - t|^q $$
$$ \mathbb{E}[L_q] = \int |y(\bold{x}) - t|^q p(\bold{x}, t) \space d\bold{x} dt $$
The minimum of the expected loss is the prediction that minimizes the expected loss and for
- $q = 2$ the solution is the conditional mean
- $q = 1$ the solution is the conditional median
- $q = 0$ the solution is the conditional mode
By using the slider for $q$, you can see how the Minkowski loss function changes for different values of $q$.