Although many are already familiar with linear regression from their studies and it may seem a bit basic at first glance, it is a foundational concept for more advanced topics in machine learning. In this series, we will briefly describe linear regression formally and later go deeper into the subject from a Bayesian perspective.
Note: This post is based on the book “Pattern Recognition and Machine Learning” by Christopher M. Bishop 1 and slides from the lecture “Advanced Machine Learning” by Prof. Dr. Bastian Leibe 2. Both these source are great to dive deeper into the topic. All equations are taken from these sources, much of the explanations are also based on them and the visualizations are inspired by them, but implemented by me.
To follow along, you should have a basic understanding of linear algebra, calculus, and probability theory (for the later parts). If you are not familiar with these topics, I recommend you to read up on them first.
introduction
Given a training set $\bold{X} = \{x_1, …, x_N\}$ and target values $\bold{T} = \{t_1, …, t_N\}$, we want to learn a continuous function $y(x)$ to predict the function value for a new input $x$. To achieve this, we have to do following:
Choose the Model Form: Select the form of the function $y(x,\bold{w})$ where $\bold{w}$ are the parameters. This defines how the inputs $x$ are mapped to the predicted outputs.
Define the Error Function: Specify an error function $E(\bold{w})$ that measures the difference between the predicted values and the actual target values. This function will guide the optimization process.
Optimize the Parameters: Adjust the parameters $\bold{w}$ to minimize the error function $E(\bold{w})$. The goal is to find the parameter values that result in the best fit of the model to the data.
Analyze the Solution: Evaluate the properties of the obtained solution, including its performance and potential limitations. This analysis helps in understanding the model’s behavior and its applicability to different scenarios.
toy dataset
To illustrate the concepts, we will use a toy dataset generated by a sinusoidal function with Gaussian noise. The function is defined as:
$$ f(x) = sin(2\pi x) + \epsilon $$
where $\epsilon$ is a Gaussian noise with mean $\mu = 0$ and standard deviation $\sigma = 0.1$. The simulated dataset consists of $10$ points sampled from the function. Below, you can inspect the function and adjust the noise level using the slider or resample the points.
steps towards a solution
Our goal is now to fit a polynomial function to this data, where $M$ is the degree of the polynomial and $\bold{w} = \{w_0, …, w_M\}$ are the parameters of the model:
$$ y(x, \bold{w}) = w_0 + w_1x + w_2x^2 + … + w_Mx^M \newline = \sum_{j=0}^{M} w_jx^j $$
Note: the function $y(x, \bold{w})$ is linear in the parameters $\bold{w}$, but does not have to be linear in the input $x$. It is called linear regression because it refers to the linearity in the parameters.
So now that we are done with step 1, we need to define an error function to optimize. The most common error function is the sum of squared errors (SSE) between the predicted values and the target values:
$$ E(\bold{w}) = \frac{1}{2} \sum_{n=1}^{N} \left( y(x_n, \bold{w}) - t_n \right)^2 $$
where $N$ is the number of data points. The motivation for this particular function will be shown later in this series. So, model form is chosen, error function is defined and now we can move on to step 3, which is optimization. In our case, we need to minimize the error function $E(\bold{w})$ with respect to the parameters $\bold{w}$. How do we minimize the error? The solution is always to compute the derivative and set it to zero.
$$ \frac{\partial E(\mathbf{w})}{\partial w_j} \newline = \sum_{n=1}^{N} \left( y(\mathbf{x}_n, \mathbf{w}) - t_n \right) \frac{\partial y(\mathbf{x}_n, \mathbf{w})}{\partial w_j} \stackrel{!}{=} 0 $$
Since the error function is a quadratic function of the parameters, the derivate will be linear in $\bold{w}$, which means the solution is unique and can be found in closed form.
least-squares regression
Let’s solve the equation above by trying to enforce $\bold{x_i^T}\bold{w} + w_0 = t_i$ (to minimize the distance from the predicted value to the target value) for all $i$. This means we have one linear equation for each data point.
Setup
- Define augmented input vector and weight vector
$$ \tilde{x}_i = \begin{pmatrix} x_i \\ 1 \end{pmatrix}, \quad \tilde{\mathbf{w}} = \begin{pmatrix} \mathbf{w} \\ w_0 \end{pmatrix} $$
- Rewrite
$$ \tilde{x}_i^T \tilde{\mathbf{w}} = t_i, \quad \forall i = 1, \ldots, n $$
- Matrix-vector notation
$$ \mathbf{\tilde{X}}^T \tilde{\mathbf{w}} = \mathbf{t} $$
$$ \mathbf{\tilde{X}} = [\tilde{x}_1, \ldots, \tilde{x}_n], \quad \mathbf{t} = [t_1, \ldots, t_n]^T $$
- Find least-squares solution $$ | \mathbf{\tilde{X}}^T \tilde{\mathbf{w}} - \mathbf{t} |^2 \rightarrow \min $$
As I mentioned earlier, when we aim to minimize something (as in this case), we need to compute the derivative and set it to zero. This gets us to the following solution:
$$ \tilde{\mathbf{w}} = (\mathbf{\tilde{X}}\mathbf{\tilde{X}}^T)^{-1}\mathbf{\tilde{X}}\mathbf{t} $$
proof
The expression $|\mathbf{\tilde{X}}^T \tilde{\mathbf{w}} - \mathbf{t}|^2$ represents the squared Euclidean norm (or squared L2 norm) of the vector $\mathbf{\tilde{X}}^T \tilde{\mathbf{w}} - \mathbf{t}$. This norm essentially measures the total squared distance between the predicted values $\mathbf{\tilde{X}}^T \tilde{\mathbf{w}}$ and the actual values $\mathbf{t}$. Expanding this, the squared norm can be written as:
$$ |\mathbf{\tilde{X}}^T \tilde{\mathbf{w}} - \mathbf{t}|^2 = (\mathbf{\tilde{X}}^T \tilde{\mathbf{w}} - \mathbf{t})^T (\mathbf{\tilde{X}}^T \tilde{\mathbf{w}} - \mathbf{t}) $$
To find the minimum, we differentiate the objective function with respect to $\tilde{\mathbf{w}}$ and set the derivative to zero.
Objective function:
$$ J(\tilde{\mathbf{w}}) = |\mathbf{\tilde{X}}^T \tilde{\mathbf{w}} - \mathbf{t}|^2 $$
Gradient with respect to $\tilde{\mathbf{w}}$: To minimize $J(\tilde{\mathbf{w}})$, compute the gradient:
$$ \nabla J(\tilde{\mathbf{w}}) = \frac{\partial J(\tilde{\mathbf{w}})}{\partial \tilde{\mathbf{w}}} $$
Using the chain rule and linear algebra properties:
$$ \nabla J(\tilde{\mathbf{w}}) = 2\mathbf{\tilde{X}}(\mathbf{\tilde{X}}^T \tilde{\mathbf{w}} - \mathbf{t}) $$
Here, the factor of $2$ comes from differentiating the squared norm, and $\mathbf{\tilde{X}}$ comes from the chain rule application to the inner product.
Set the gradient to zero: Dividing by $2$ we get
$$ \mathbf{\tilde{X}}(\mathbf{\tilde{X}}^T \tilde{\mathbf{w}} - \mathbf{t}) = 0 $$
Solve for $\tilde{\mathbf{w}}$: $$ \mathbf{\tilde{X}} \mathbf{\tilde{X}}^T \tilde{\mathbf{w}} = \mathbf{\tilde{X}}\mathbf{t} $$
Assuming $\mathbf{\tilde{X}} \mathbf{\tilde{X}}^T$ is invertible, we can solve for $\tilde{\mathbf{w}}$ and get the least-squares solution:
$$ \tilde{\mathbf{w}} = (\mathbf{\tilde{X}}\mathbf{\tilde{X}}^T)^{-1}\mathbf{\tilde{X}}\mathbf{t} $$
model complexity
Okay, we found the unique solution, but currently, we are solving for a polynomial of degree $1$. What if we want to fit a polynomial of degree $M$? We can do this by introducing a feature transformation $\phi(x)$, which maps the input $x$ to a higher-dimensional space. The polynomial regression can then be written as:
$$ y(\mathbf{x}) = \mathbf{w}^T \boldsymbol{\phi}(\mathbf{x}) = \sum_{i=0}^{M} w_i \phi_i(\mathbf{x}) $$
We assume $\phi_0(\bold{x}) = 1$ for the bias term. The feature transformation $\boldsymbol{\phi}(\bold{x})$ for a cubic polynomial would then look like $\boldsymbol{\phi}(\bold{x}) = [1, x, x^2, x^3]^T$ for example. Which order of the polynomial should we pick though? To compare different solutions, we will use a measurement of the error. Using the error function $E(\bold{w})$ would be possible, but it has the disadvantage that it scales with the number of data points. To measure the generalization performance on test sets of different sizes, we can use the Root Mean Square Error (RMS) given by:
$$ E_{RMS} = \sqrt{2E(\bold{w}^*)/N} $$
By dividing the error by the number of data points, we get the average error per data point. The square root is taken to get the error in the same units as the target values. The optimal parameters $\bold{w}^*$ are the ones that minimize the error function $E(\bold{w})$. The RMS is measured against the training set, which is used to get the least-squares solution and against a test-set, that is generated from the same generative function as in the beginning.
So, let’s compare the RMS for different polynomial orders. You can adjust the noise level and the sample size using the sliders and change the order of the polynomial that is fitted to the data. The RMS for the training and test set is displayed below the chart.
observations
- Lower order polynomials are too simple to capture the underlying function, regardless of the sample size → underfitting
- For smaller samples, higher-order polynomials reach zero training error while test error increases → overfitting
- Increasing the sample size reduces the effect of overfitting.
Trivially, every lower-order polynomial is a special case of a higher-order polynomial where the parameters for the polynomials not present in the lower-order polynomial are set to zero. This means that the higher-order polynomial can represent the lower-order polynomial and more. So why are the higher-order polynomials performing seemingly worse? If we check the parameters, we can see that they get really large. This is due to overfitting - the model perfectly fits to the noise which we added to our toy dataset.
Now let us compare the polynomial of order $9$ again, but this time with data sets of two different sizes. The bigger the data-set is, the lesser the effect of overfitting is present. If we check the parameters again, you can see that the one with the bigger data set has smaller parameters.
regularization
Since the number of sample-points is something we often cannot influence, we introduce regularization to avoid overfitting, since we still want to use complex and flexible models that are able to generalize. Regularization penalizes large coefficient values by adding a term to the error function, that is proportional to the sum of the squared parameters. This term is called the regularization term and is weighted by the regularization parameter $\lambda$. The most common regularization term is a quadratic regularizer (Ridge Regression), which is given by:
$$ E(\bold{w}) = \frac{1}{2} \sum_{n=1}^{N} \left( y(x_n, \bold{w}) - t_n \right)^2 + \red{\frac{\lambda}{2} \bold{w}^T \bold{w}} $$
The regularization term is also called the weight decay term, as it encourages the weights to decay towards zero for increasing $\lambda$. Setting the trade-off parameter $\lambda$ to zero results in the least-squares solution, since it then has no effect on the error function. By slowly increasing the trade-off parameter $\lambda$, we can observe the effect of regularization on the model complexity and the RMS error. The optimal value for $\lambda$ can be found by cross-validation.
Comparing the results for different values of $\lambda$ shows that the regularization term indeed helps to avoid overfitting. The trade-off parameter now controls the effective model complexity and thus the degree of overfitting.
conclusion
We used linear regression to fit a polynomial of different orders to a toy dataset. We observed that higher-order polynomials tend to overfit the data, while lower-order polynomials could be too simple to capture the underlying function. Bigger data sets and regularization techniques like Ridge Regression can help to avoid overfitting and improve the generalization performance of the model.
In the next part of this series, we will look into Probability Theory and Maximum Likelihood Estimation to start our journey towards Bayesian Linear Regression.