Last lecture we've looked at simple linear regression ($Y_i=\beta_0 + \beta_1x_i + \epsilon_i$) \begin{center} \begin{tikzpicture} \begin{axis}[ xmin=50, xmax=100, xlabel=temperature, ymin=100, ymax=250, ylabel=yiels, samples=400, axis x line=bottom, axis y line=left, domain=50:100, ] \addplot[mark=x,only marks, blue] coordinates { (50,120) (53,115) (54,125) (55,119) (56,120) (59,140) (62,145) (64,143) (67,147) (71,157) (72,160) (74,175) (75,159) (76,177) (79,180) (80,185) (82,182) (85,185) (87,188) (89,200) (93,195) (94,203) (95,204) (97,212) }; \end{axis} \end{tikzpicture} \end{center} This lecture we'll look at multiple linear regression (more than one predictor) \input{./TeX_files/materials/multiple_linear_regression} \subsection{Structure of multiple linear regression models} Good news: simple linear models are a special case of \begriff{multiple linear regression} and with minor extensions everything we've looked at so far still applies. For $p$ predictors and data tuples $\{(x_{1i},x_{2i},...,x_{pi},Y_i)\mid i=1,...,n\}$, we assume the relationship \begin{align} Y_i = \beta_0 + \sum_{j=1}^{p} \beta_jx_{ji} + \epsilon_i\notag \end{align} As before, we assume $\epsilon_i\overset{i.i.d}{\sim}Normal(0,\sigma)$. The matrix notation, $Y=X\beta + \epsilon$, now becomes very convenient: \begin{align} Y = \begin{pmatrix} Y_1 \\ \vdots \\ Y_n \end{pmatrix}\quad X = \begin{pmatrix} 1 & x_{11} & \dots & x_{p1} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_{1n} & \dots & x_{pn} \end{pmatrix} \quad \beta = \begin{pmatrix} \beta_0 \\ \vdots \\ \beta_p \end{pmatrix}\quad \epsilon = \begin{pmatrix} \epsilon_1 \\ \vdots \\ \epsilon_n \end{pmatrix} \notag \end{align} As before, the random variable notation is \begin{align} Y_i &\overset{i.i.d}{\sim} Normal\left(\beta_0 + \sum_{j=1}^{p} \beta_jx_{ji},\sigma\right) \notag \\ Y &\sim Normal(X\beta,\mathbbm{1}\sigma) \notag \end{align} Model fitting using Maximum Likelihood Estimation (MLE) proceeds in the same way as for simple linear models seen before and is equivalent to Ordinary Least Squares (OLS) fitting. The likelihood function is given by \begin{align} L(\beta\mid X) = \prod_{i=1}^n f_{Normal}\left(Y_i,\mu=\beta_0+\sum_{j=1}^{p} \beta_jx_{ji},\sigma\right)\notag \end{align} where $f_{Normal}$ is the PDF for the normal distribution evaluated at $Y_i$. The exact equations for parameter MLEs are \begin{align} \label{MLR_parameter_estimates} \begin{split} \hat{\beta} &= (X^TX)^{-1}X^TY \\ \hat{\sigma}^2 &= \frac{1}{n-(p+1)}(Y-X\hat{\beta})^T(Y-X\hat{\beta}) \end{split} \end{align} Fitted values for the response are: $\hat{Y} = X\hat{\beta}$. Parameter estimates for one explanatory variable describe the relationship between this variable and the response variable when all other explanatory variables are held fixed. \subsection{Assumptions of linear models} The assumptions for multiple linear regression models are the same as for simple linear models: \begin{itemize} \item \textbf{Linearity:} response variable is a linear combination of the explanatory variables \item \textbf{Normality:} errors follow a normal distribution \item \textbf{Homoscedasticity:} variance of the response variable (or errors) is constant. \item \textbf{Independence:} the errors are uncorrelated (ideally statistically independent). \item \textbf{Weak exogeneity:} the explanatory variables can be treated as fixed values, rather than random variables. \end{itemize} However, there is one important addition: \begin{itemize} \item \textbf{Lack of perfect multicollinearity:} if two explanatory variables are perfectly correlated, we can not solve the equation for parameter estimates (\cref{MLR_parameter_estimates}). Some (but not perfect!) correlation between explanatory variables my be permissible. \end{itemize} Model checking works in the same way as for simple linear models, using residuals. Additionally, may want to plot each explanatory variables versus residuals. \subsection{Hypothesis testing on linear model parameters} Using a conceptionally similar approach as for simple linear models, we can test hypothesis about the $\beta$s and construct confidence intervals for them. Importantly, the test statistics depend on the entries of the matrix \begin{align} (X^TX)^{-1} = \begin{pmatrix} c_{00} & c_{01} & \dots & c_{0p} \\ \vdots & \vdots & \vdots & \vdots \\ c_{p0} & c_{p1} & \dots & c_{pp} \end{pmatrix}\notag \end{align} as $\sigma_{\hat{\beta_j}}=\hat{\sigma}\cdot\sqrt{c_{jj}}$ ($j=1,...,p$). Off-diagonal entries determine the covariance of estimates and are important for the variance in prediction. Generally, $\sigma$ needs to be estimated, so in practise we use \person{Student}'s t-distribution and the test statistic \begin{align} t = \frac{\hat{\beta_j}-\text{hypothesised value}}{\hat{\sigma}\sqrt{c_{jj}}} \notag \end{align} \textcolor{red}{Many details omitted - software does the work for us!} \subsection{Model selection} With the more general Linear Model formulation, we can create many different models. How to choose which model to use? \begin{example} Consider the range of electric vehicles. \begin{center} \begin{tikzpicture} \begin{axis}[ xbar, xmin=0, xlabel={Kilometers}, axis lines*=left, symbolic y coords={% {Smart EV}, {Focus Electric}, {BMW i3}, {Nissan Leaf}, {Tesla S 850}}, ytick=data, nodes near coords, nodes near coords align={horizontal}, ytick=data, ] \addplot coordinates { (110,{Smart EV}) (122,{Focus Electric}) (130,{BMW i3}) (172,{Nissan Leaf}) (426,{Tesla S 850})}; \end{axis} \end{tikzpicture} \end{center} Depends on battery size, driving style or ambient temperature? Or all factors? \end{example} The right model depends on the purpose: Find all relevant factors? Prediction of range on a given day? There is not one correct approach to model selection. What makes a good model depends on what it is designed for. Consequently, there are may different tools and techniques. Broadly, there are three types of model selection approaches: \begin{itemize} \item Hypothesis tests on model parameters within models (include/exclude parameters). \item Measures that describe the quality or goodness of fit of models. \item Hypothesis tests comparing the fit of entire models (often related to measures from the previous point). \end{itemize} Fundamentally, model selection is at the heart of scientific inquiry. Statistical techniques offer one quantitative and rigorous approach. We'll go through a few of the most common statistical techniques. \subsubsection{Hypothesis tests on model parameters} We've encountered these already (test $H_0$: $\beta_j=0$). Could determine our model by fitting a full model that includes all conceivable explanatory variables first and than removing the explanatory variables for which we can't reject $H_0$. \textbf{Problems:} \begin{itemize} \item Multiple comparisons: conducting many hypothesis tests means that some may be rejected/accepted by chance (there are ways of dealing with this, e.g. \person{Bonferroni} correction). \item The model fit changes every time we remove an explanatory variable. \end{itemize} While these tests are useful, we may want other approaches that allow us to test global hypotheses about models. \subsubsection{$R^2$ and adjusted $R^2$} Last lecture we looked at the \begriff{coefficient of correlation}, $R^2$ (proportion of variance in response explained by explanatory variables). Could use this to distinguish between models: the closer to 1 it is, the better the model. \textbf{Problem:} $R^2$ always increases when we include more explanatory variables. \begin{proposition}[\person{Occam}'s Razor by \person{William of Occam}, 1287-1347] \label{occams_razor} Entities are not to be multiplied without necessity (Latin: Non sunt multiplicanda entia sine necessitate.) Basically: prefer simpler models. \end{proposition} \textbf{Solution:} Adjusted $R^2$, $R^2_a$: unlike $R^2$, this takes into account the sample size $n$ and the number of model parameters. \textbf{Warning:} I caution against relying on $R^2_a$ or $R^2$. They can only be usefully applied in particular circumstances. \subsubsection{F-test on linear models} Suppose we want to test a more general, global hypothesis about a linear model: \\ $H_0$: $\beta_1=\beta_2=\dots=\beta_p=0$ \\ $H_A$: At least one of the $\beta_j\neq 0$ \\ i.e. we compare our specified model to a constant model that only has an intercept $\beta_0$. This is called \begriff{F-test} or Analysis of Variance (Global). It uses the F-distribution and the test statistic can be expressed in terms of $R^2$ (we skip the details). \textbf{Warnings}: \begin{itemize} \item The test is very specific (we'll see a more general test in a moment). It asks if the improvement in model fit is larger than expected under $H_0$. \item It makes several assumptions (e.g. normally distributed errors), so check model assumptions hold. \item It doesn't tell us whether our model is correct. \end{itemize} \subsubsection{Quality measures based on the likelihood} Techniques so far rely on the variance in the response explained by models. Recall the analogy for MLE of maximising $P(data\mid parameters)$. Could use the maximum likelihood of models, $\hat{L}$, to compare them: highest $\hat{L}$ indicates best model. \textbf{Warning:} the likelihood faces the same problem as $R^2$ - it will always improve when more parameters are added to the model. \textbf{Solution:} penalty for parameters in measures for relative model quality. \begin{itemize} \item \textbf{Akaike Information Criterion:} \begin{align} \text{AIC} = 2k-2\ln(\hat{L})\notag \end{align} where $k$ is the number of model parameters, including the intercept, but typically excluding the error variance (as this is included in all models). Model with smallest AIC is best. \item \textbf{Bayesian Information Criterion:} \begin{align} \text{BIC} = \ln(n)k - 2\ln(\hat{L})\notag \end{align} Same idea, but penalty for parameters is stronger (based on different assumptions). \end{itemize} \textbf{Warning:} candidate models must be fitted to the same response data. \subsubsection{Likelihood-ratio test for nested models} The Likelihood-ratio test is a much more general version of the F-test. Consider a linear model, $M_1$ with parameters $\beta_{M_1}=\{\beta_1,\beta_2,...,\beta_p\}$. The test considers a restriction $M_0$ of $M_1$ where e.g. $\beta_{M_0} = \{\beta_1,\beta_2,...,\beta_q,0,...,0\}$ with $q