Last lecture we've looked at simple linear regression ($Y_i=\beta_0 + \beta_1x_i + \epsilon_i$)
\begin{center}
	\begin{tikzpicture}
	\begin{axis}[
	xmin=50, xmax=100, xlabel=temperature,
	ymin=100, ymax=250, ylabel=yiels,
	samples=400,
	axis x line=bottom,
	axis y line=left,
	domain=50:100,
	]
	\addplot[mark=x,only marks, blue] coordinates {
		(50,120)
		(53,115)
		(54,125)
		(55,119)
		(56,120)
		(59,140)
		(62,145)
		(64,143)
		(67,147)
		(71,157)
		(72,160)
		(74,175)
		(75,159)
		(76,177)
		(79,180)
		(80,185)
		(82,182)
		(85,185)
		(87,188)
		(89,200)
		(93,195)
		(94,203)
		(95,204)
		(97,212)
	};
	\end{axis}
	\end{tikzpicture}
\end{center}

This lecture we'll look at multiple linear regression (more than one predictor)
\input{./TeX_files/materials/multiple_linear_regression}

\subsection{Structure of multiple linear regression models}

Good news: simple linear models are a special case of \begriff{multiple linear regression} and with minor extensions everything we've looked at so far still applies. For $p$ predictors and data tuples $\{(x_{1i},x_{2i},...,x_{pi},Y_i)\mid i=1,...,n\}$, we assume the relationship
\begin{align}
	Y_i = \beta_0 + \sum_{j=1}^{p} \beta_jx_{ji} + \epsilon_i\notag
\end{align}
As before, we assume $\epsilon_i\overset{i.i.d}{\sim}Normal(0,\sigma)$. 

The matrix notation, $Y=X\beta + \epsilon$, now becomes very convenient:
\begin{align}
	Y = \begin{pmatrix}
		Y_1 \\ \vdots \\ Y_n
	\end{pmatrix}\quad X = \begin{pmatrix}
		1 & x_{11} & \dots & x_{p1} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_{1n} & \dots &  x_{pn}
	\end{pmatrix} \quad \beta = \begin{pmatrix}
		\beta_0 \\ \vdots \\ \beta_p
	\end{pmatrix}\quad \epsilon = \begin{pmatrix}
		\epsilon_1 \\ \vdots \\ \epsilon_n
	\end{pmatrix} \notag
\end{align}

As before, the random variable notation is
\begin{align}
	Y_i &\overset{i.i.d}{\sim} Normal\left(\beta_0 + \sum_{j=1}^{p} \beta_jx_{ji},\sigma\right) \notag \\
	Y &\sim Normal(X\beta,\mathbbm{1}\sigma) \notag
\end{align}

Model fitting using Maximum Likelihood Estimation (MLE) proceeds in the same way as for simple linear models seen before and is equivalent to Ordinary Least Squares (OLS) fitting. The likelihood function is given by
\begin{align}
	L(\beta\mid X) = \prod_{i=1}^n f_{Normal}\left(Y_i,\mu=\beta_0+\sum_{j=1}^{p} \beta_jx_{ji},\sigma\right)\notag
\end{align}
where $f_{Normal}$ is the PDF for the normal distribution evaluated at $Y_i$. The exact equations for parameter MLEs are
\begin{align}
	\label{MLR_parameter_estimates}
	\begin{split}
		\hat{\beta} &= (X^TX)^{-1}X^TY \\
		\hat{\sigma}^2 &= \frac{1}{n-(p+1)}(Y-X\hat{\beta})^T(Y-X\hat{\beta})
	\end{split}
\end{align}
Fitted values for the response are: $\hat{Y} = X\hat{\beta}$. Parameter estimates for one explanatory variable describe the relationship between this variable and the response variable when all other explanatory variables are held fixed.

\subsection{Assumptions of linear models}

The assumptions for multiple linear regression models are the same as for simple linear models:
\begin{itemize}
	\item \textbf{Linearity:} response variable is a linear combination of the explanatory variables
	\item \textbf{Normality:} errors follow a normal distribution
	\item \textbf{Homoscedasticity:} variance of the response variable (or errors) is constant.
	\item \textbf{Independence:} the errors are uncorrelated (ideally statistically independent).
	\item \textbf{Weak exogeneity:} the explanatory variables can be treated as fixed values, rather than random variables.
\end{itemize}
However, there is one important addition:
\begin{itemize}
	\item \textbf{Lack of perfect multicollinearity:} if two explanatory variables are perfectly correlated, we can not solve the equation for parameter estimates (\cref{MLR_parameter_estimates}). Some (but not perfect!) correlation between explanatory variables my be permissible.
\end{itemize}

Model checking works in the same way as for simple linear models, using residuals. Additionally, may want to plot each explanatory variables versus residuals.

\subsection{Hypothesis testing on linear model parameters}

Using a conceptionally similar approach as for simple linear models, we can test hypothesis about the $\beta$s and construct confidence intervals for them. Importantly, the test statistics depend on the entries of the matrix
\begin{align}
	(X^TX)^{-1} = \begin{pmatrix}
		c_{00} & c_{01} & \dots & c_{0p} \\ \vdots & \vdots & \vdots & \vdots \\ c_{p0} & c_{p1} & \dots & c_{pp}
	\end{pmatrix}\notag
\end{align}
as $\sigma_{\hat{\beta_j}}=\hat{\sigma}\cdot\sqrt{c_{jj}}$ ($j=1,...,p$). Off-diagonal entries determine the covariance of estimates and are important for the variance in prediction. Generally, $\sigma$ needs to be estimated, so in practise we use \person{Student}'s t-distribution and the test statistic
\begin{align}
	t = \frac{\hat{\beta_j}-\text{hypothesised value}}{\hat{\sigma}\sqrt{c_{jj}}} \notag
\end{align}
\textcolor{red}{Many details omitted - software does the work for us!}

\subsection{Model selection}

With the more general Linear Model formulation, we can create many different models. How to choose which model to use?

\begin{example}
	Consider the range of electric vehicles.
	\begin{center}
		\begin{tikzpicture}
		\begin{axis}[ 
		xbar, xmin=0,
		xlabel={Kilometers},
		axis lines*=left,
		symbolic y coords={%
			{Smart EV},
			{Focus Electric},
			{BMW i3},
			{Nissan Leaf},
			{Tesla S 850}},
		ytick=data,
		nodes near coords, 
		nodes near coords align={horizontal},
		ytick=data,
		]
		\addplot coordinates {
			(110,{Smart EV}) 
			(122,{Focus Electric}) 
			(130,{BMW i3}) 
			(172,{Nissan Leaf})
			(426,{Tesla S 850})};
		\end{axis}
		\end{tikzpicture}
	\end{center}
	Depends on battery size, driving style or ambient temperature? Or all factors?
\end{example}

The right model depends on the purpose: Find all relevant factors? Prediction of range on a given day?

There is not one correct approach to model selection. What makes a good model depends on what it is designed for. Consequently, there are may different tools and techniques. Broadly, there are three types of model selection approaches:
\begin{itemize}
	\item Hypothesis tests on model parameters within models (include/exclude parameters).
	\item Measures that describe the quality or goodness of fit of models.
	\item Hypothesis tests comparing the fit of entire models (often related to measures from the previous point).
\end{itemize}
Fundamentally, model selection is at the heart of scientific inquiry. Statistical techniques offer one quantitative and rigorous approach. We'll go through a few of the most common statistical techniques.

\subsubsection{Hypothesis tests on model parameters}

We've encountered these already (test $H_0$: $\beta_j=0$). Could determine our model by fitting a full model that includes all conceivable explanatory variables first and than removing the explanatory variables for which we can't reject $H_0$.

\textbf{Problems:}
\begin{itemize}
	\item Multiple comparisons: conducting many hypothesis tests means that some may be rejected/accepted by chance (there are ways of dealing with this, e.g. \person{Bonferroni} correction).
	\item The model fit changes every time we remove an explanatory variable.
\end{itemize} 

While these tests are useful, we may want other approaches that allow us to test global hypotheses about models.

\subsubsection{$R^2$ and adjusted $R^2$}

Last lecture we looked at the \begriff{coefficient of correlation}, $R^2$ (proportion of variance in response explained by explanatory variables). Could use this to distinguish between models: the closer to 1 it is, the better the model.

\textbf{Problem:} $R^2$ always increases when we include more explanatory variables.

\begin{proposition}[\person{Occam}'s Razor by \person{William of Occam}, 1287-1347]
	\label{occams_razor}
	Entities are not to be multiplied without necessity (Latin: Non sunt multiplicanda entia sine necessitate.) Basically: prefer simpler models.
\end{proposition}

\textbf{Solution:} Adjusted $R^2$, $R^2_a$: unlike $R^2$, this takes into account the sample size $n$ and the number of model parameters.

\textbf{Warning:} I caution against relying on $R^2_a$ or $R^2$. They can only be usefully applied in particular circumstances.

\subsubsection{F-test on linear models}

Suppose we want to test a more general, global hypothesis about a linear model: \\
$H_0$: $\beta_1=\beta_2=\dots=\beta_p=0$ \\
$H_A$: At least one of the $\beta_j\neq 0$ \\
i.e. we compare our specified model to a constant model that only has an intercept $\beta_0$. This is called \begriff{F-test} or Analysis of Variance (Global). It uses the F-distribution and the test statistic can be expressed in terms of $R^2$ (we skip the details).

\textbf{Warnings}:
\begin{itemize}
	\item The test is very specific (we'll see a more general test in a moment). It asks if the improvement in model fit is larger than expected under $H_0$.
	\item It makes several assumptions (e.g. normally distributed errors), so check model assumptions hold.
	\item It doesn't tell us whether our model is correct.
\end{itemize}

\subsubsection{Quality measures based on the likelihood}

Techniques so far rely on the variance in the response explained by models. Recall the analogy for MLE of maximising $P(data\mid parameters)$. Could use the maximum likelihood of models, $\hat{L}$, to compare them: highest $\hat{L}$ indicates best model.

\textbf{Warning:} the likelihood faces the same problem as $R^2$ - it will always improve when more parameters are added to the model.

\textbf{Solution:} penalty for parameters in measures for relative model quality.
\begin{itemize}
	\item \textbf{Akaike Information Criterion:}
	\begin{align}
		\text{AIC} = 2k-2\ln(\hat{L})\notag
	\end{align}
	where $k$ is the number of model parameters, including the intercept, but typically excluding the error variance (as this is included in all models). Model with smallest AIC is best.
	\item \textbf{Bayesian Information Criterion:}
	\begin{align}
		\text{BIC} = \ln(n)k - 2\ln(\hat{L})\notag
	\end{align}
	Same idea, but penalty for parameters is stronger (based on different assumptions).
\end{itemize}

\textbf{Warning:} candidate models must be fitted to the same response data.

\subsubsection{Likelihood-ratio test for nested models}

The Likelihood-ratio test is a much more general version of the F-test. Consider a linear model, $M_1$ with parameters $\beta_{M_1}=\{\beta_1,\beta_2,...,\beta_p\}$. The test considers a restriction $M_0$ of $M_1$ where e.g. $\beta_{M_0} = \{\beta_1,\beta_2,...,\beta_q,0,...,0\}$ with $q<p$. We say $M_0$ is nested in $M_1$. We test the hypothesis: \\
$H_0$: $\beta = \beta_{M_0}$ \\
$H_A$: $\beta = \beta_{M_1}$ \\
using the test statistic 
\begin{align}
	D = 2\ln\left(\frac{\widehat{L_{M_1}}}{\widehat{L_{M_0}}}\right) = 2\bigg(\ln\big(\widehat{L_{M_1}}\big) - \ln\big(\widehat{L_{M_0}}\big)\bigg) \notag
\end{align}
Under some conditions and assuming $H_0$, $D$ is asymptotically distributed as $D\sim\chi^2_{p-q}$. This is a flexible and very useful test (e.g. tests on individual parameters). Also works when error distribution is not normal.

\subsection{Automated or standardised model selection strategies}

Model selection is time consuming. So people have tried to come up with standardised or even automated procedures, e.g. stepwise regression. Many software packages, including MATLAB, implement such procedures. There are many flavours, but the basic idea is:
\begin{enumerate}[label=\textbf{\arabic*.}]
	\item Identify response $Y$ and all potentially important explanatory variables $x_1,...,x_p$.
	\item Automatically work through models defined by all possible combinations of the $x_j$s, starting with simpler models.
	\item At each step use standard hypothesis tests (or other measures) to assess if additional explanatory variables improve model.
	\item Continue until some stopping criterion is reached.
\end{enumerate}

\textbf{Warning:} control of the process is relinquished to software that can be very intricate. Procedure conducts many statistical tests (multiple comparisons!). Because of these and other issues, many statisticians recommend not to use this approach.

\begin{center}
	\begin{tabular}{c|c}
		\textbf{use} & \textbf{technique} \\
		\hline
		Hypothesis tests for single parameters & t-test, Likelihood-ratio test \\
		Hypothesis tests for several parameters & F-test, Likelihood-ratio test \\
		Measures for relative quality of models & $R^2$, $R^2_a$, AIC, BIC
	\end{tabular}
\end{center}

\textbf{Warning}: none of these techniques tell you anything about the correctness of a model. So checking model assumptions separately is important!

Remember, none of these techniques are perfect. Choose your approach depending on what you are trying to achieve with your models. Within this in mind, prediction intervals and residual plots could be also used in model selection.