Based on the “Statistical Consulting Cheatsheet” by Prof. Kris Sankaran
Linear models provide the basis for most inference in multivariate settings. We won’t even begin to try to cover this topic comprehensively. There are entire course sequences that only cover linear models. But, we’ll try to highlight the main regression-related ideas that are useful to know during consulting.
This section is focused more on the big-picture of linear regression and when we might want to use it in consulting. We defer a discussion of inference in linear models to the next section.
There are three steps in the analysis of data using a (Generalized) Linear Model:
Figure: the three steps in GLM analysis.
Depending on the type of data that you have at-hand, you might opt for:
Linear regression learns a linear function between covariates and a response, and is popular because there are well-established methods for performing inference for common hypothesis.
Generally, model-fitting procedures suppose that there is a single variable \(Y\) of special interest. The goal of a supervised analysis is to determine the relationship between many other variables (called covariates), \(X_1, \cdots X_p\) and \(Y\) . Having a model \(Y = f(X_1, \cdots X_p)\) can be useful for many reasons, especially (1) improved scientific understanding (the functional form of f is meaningful) and (2) prediction, using \(f\) learned on one set of data to guess the value of \(Y\) on a new collection of \(X\)’s.
Linear models posit that the functional form \(f\) is linear in the \(X_1, \cdots X_p\).
This is interpreted geometrically by saying that the change in \(Y\) that
occurs when \(X_j\) is changed by some amount (and all other covariates are
held fixed) is proportional to the change in \(X_j\) . E.g., when \(p = 1\), this
means the scatterplot of \(X_1\) against \(Y\) can be well-summarized by a line.
A little more formally, we suppose we have observed samples indexed by
\(i\), where \(x_i \in \mathbb{R}^p\) collects the covariates for the \(i\)th sample, and \(y_i\) is the
associated observed response. A regression model tries to find a parameter
\(\beta \in \mathbb{R}^p\) so that
is plausible, where \(i\) are drawn i.i.d. from \(N(0, \sigma^2)\). for some (usually unknown) \(\sigma^2\). The fitted value for \(\beta\) after running a linear regression is denoted \(\hat{\beta}\) .
In a consulting setting, regression is useful for understanding the association between two variables, controlling for many others. This is basically a rephrasing of point (2) above, but it’s the essential interpretation of linear regression coecients, and it’s this interpretation that many research studies are going after.
Sometimes a client might originally come with a testing problem, but might want help extending it to account for additional structure or covariates. In this setting, it can often be useful to propose a linear model instead: it still allows inference, but it becomes much easier to encode more complex structure.
Linear mixing can be done with linear models. To see why this is potentially useful, suppose we want to use time as a predictor in a model (e.g., where \(Y\) is number of species j present in the sample), but that species population doesn’t just increase or decrease lin- early over time (instead, it’s some smooth curve). Here, you can introduce a spline basis associated with time and then use a linear regression of the response onto these basis functions. The fitted coecients will define a mean function relating time and the response.
Figure 5: In the simplest setting, an interaction between a continuous and binary variable leads to two different slopes for the continuous variable. Here, we are showing the scatterplot ( x_{i1}; y_i ) pairs observed in the data. We suppose there is a binary variable that has also been measured, denoted x_{i2}$, and we shade in each point according to its value for this binary variable. Apparently, the relationship between x_1 and y depends on the value of y (when in the pink group, the slope is less. This can exactly be captured by introducing an interaction term between x_1 and x_2. In cases where x_2 is not binary, we would have a continuous of slopes between x_1 and y – one for each value of x_2. that aren't just the ones that were collected originally might not be obvious to your client. For example, if you were trying to predict whether someone will have a disease based on time series of some lab tests, you can construct new variables corresponding to the \slope at the beginning," or \slope at the end" or max, or min, ... across the time series. Of course, deciding which variables might actually be relevant for the regression will depend on domain knowledge. One trick --- introducing random effects --- is so common that it gets it's own section. Basically, it's useful whenever you have a lot of levels for a particular categorical vector..
How can you assess whether a linear regression model is appropriate? Many types of diagnostics have been proposed, but a few of the most common are:
Figure 7: Some of the most common types of \structure" to watch out for in residual plots are displayed here. The top left shows how residuals should appear { they look essentially i.i.d. N (0; 1). In the panel below, there is nonconstant variance in the residuals, the one on the bottom has an extreme outlier. The panel on the top-right seems to have means that are far from zero in a structured way, while the one below has some anomalous discreteness. leverage, you might consider throwing them out. Alternatively, you could consider a robust regression method.
Make a qq-plot of residuals: More generally than simply finding large outliers in the residuals, we might ask whether the residuals are plausibly drawn from a normal distribution. qq-plots give one way of doing this – more often than not the tails are heavier than normal. Most people ignore this, but it can be beneficial to consider e.g. robust regression or considering logistically (instead of normally) distributed errors.
Calculate the leverage of different points: The leverage of a sample is a measure of how much the overall fit would change if you took that point out. Points with very high leverage can be cause for concern { it’s bad if your fit completely depended on one or two observations only { and these high leverage points often turn out to be outliers. See Figure 8 for an example of this phenomenon. If you find a few points have very high
Figure 8: The leverage of a sample can be thought of as the amount of influence it has in a fpt. Here, we show a scatterplot onto which we fit a regression line. The cloud near the origin and the one point in the bottom right represent the observed samples. The blue line is the regression fit when ignoring the point on the bottom right, while the pink line is the regression including that point. Evidently, this point in the bottom right has very high leverage – in fact, it reverses the sign of the association between X and Y . This is also an example of how outliers (especially outliers in the X direction) can have very high leverage.
Logistic regression is the analog of linear regression that can be used whenever the response \(Y\) is binary (e.g., patient got better, respondent answered “yes,” email was spam).
In linear regression, the response \(y\) are directly used in a model of the form \(y_i = x^T_i\beta+ \epsilon\).
In logistic regression, we now want a model between the \(x_i\) and the unknown probabilities of the two classes when we’re at \(x_i\): \(p(x_i)\) and \(1 - p (x_i\)). The observed value of \(y_i\) corresponding to \(x_i\) is modeled as being drawn from a coin flip with probability \(p (x_i)\). If we had fit an ordinary linear regression model to the \(y_i\), we might get fitted responses \(\hat{y}_i\) outside of the valid range [0; 1], which in addition to being confusing is bad modeling.
The goal Logistic regression instead models the log-odds transformation to the \(p (x_i)\). Concretely, it assumes the model:
\(p (y_1, y_2, \cdots, y_n) = \prod_{i=1}^n p_\beta(x_i)^{1_{y_i=1}} (1- p_\beta(x_i))^{1_{y_i=0}}\) where we are approximating \(p(x_i) \approx p_\beta(x_i) = \frac{1}{1+ e^{-x_i^T\beta}}\).
Figure 9: An example of the type of approximation that logistic regression makes. The x-axis represent the value of the feature, and the y-axis encodes the binary 0 / 1 response. The purple marks are observed (xi; yi) pairs. Note that class 1 becomes more common when x is large. The pink line represents the "true" underlying class probabilities as a function of x, which we denote as p (x). This doesn't lie in the logistic family 1/(1+exp(-xB) , but it can be approximated by a member of that family, which is drawn in blue (this is the logistic regression fit).
While you could study the differences \(y_i - \hat{p}(x_i)\), which would be analogous to linear regression residuals, it is usually more informative to study the Pearson or deviance residuals, which upweight small differences near the boundaries 0 and 1. These types of residuals take into account structure in the assumed bernoulli (coin-flipping) sampling model. For ways to evaluate the classification accuracy of logistic regression models, see the section on Prediction Evaluation and for an overview of formal inference, see the [section Inference in GLM].
Multinomial regression is a generalization of logistic regression for when the response can take on one of \(K\) categories (not just \(K = 2\)). Here, we want to study the way the probabilities for each of the \(K\) classes varies as x varies: \(p (y_i = j|x_i)\) for each \(k = 1 \cdots K\) Think of the responses \(y_i\) as observations from a K-sided dice, and that different faces are more probable depending on the associated features xi. The approximation of logisitc regression is replaced with
\[p(y_i = k |c_i) \approx p_W( y_i=k|x_i )= \frac{\exp(w_k^Tx_i)}{\sum_k \exp(w_k^Tx_i)}\]where the parameters \(w_1\cdots;w_K\) govern the relationship between xi and the probabilities for different classes. As is, this model is not identifiable (you can increase one of the wks and de- crease another, and end up with the exact same probabilities \(p _W(y_i = j|x_i)\).
To resolve this, one of the classes (say the Kth, this is usually chosen to be the most common class) is chosen as a baseline, and we set \(w_K = 0\). Then the \(w_k\)s can be interpreted in terms of how a change in \(x_i\) changes the probability of observing \(k\) relative to the baseline \(K\). That is, suppose we increase the \(j\)th variable by one unit, so that \(x_i \leftarrow x_i + \delta_j\) . Then, the relative probability against class K changes according to \(p(y_i = k |c_i) \approx p_W(y_i=k|x_i )= \frac{\exp(w_k^Tx_i)}{\sum_k \exp(w_k^Tx_i).}\)
Sometimes we have \(K\) classes for the responses, but there is a natural ordering between them. For example, survey respondents might have chosen one of 6 values along a likert scale. Multinomial regression is unaware of this additional ordering structure – a reasonable alternative in this setting is ordinal regression.
Here, the \(\alpha_k\)’s control the overall frequencies of the \(K\) classes, while \(\beta\) controls the influence of covariates on the response. + Outside of the latent variable interpretation, it’s also possible to under- stand \(\beta\) in terms of relative risks. In particular, when we increase the \(j\)th coordinate by 1 unit, \(x_i \leftarrow x_i + \delta_j\) , the odds of class \(k\) relative to class k-11gets multiplied by \(\exp(\beta_j)\), for every pair of neighboring classes \(k-1\) and \(k\).
Poisson regression is a type of generalized linear model that is often applied when the responses \(y_i\) are counts (i.e., \(y_i \in \{ 0, 1, \cdots \}\). As in logistic regression, one motivation for using this model is that using ordinary logistic regression on these responses might yield predictions that are impossible (e.g., numbers below zero, or which are not integers). To see where the main idea for this model comes from, recall that the Poisson distribution with rate \(\lambda\) draws integers with probabilities:
\[\mathbb{P}_{\lambda} [y=k |\lambda] = \frac{\lambda^k e^{-\lambda}}{k!}\]The idea of Poisson regression is to say that the different \(y_i\) are drawn Poissons with rates that depend on the associated \(x_i\).
More specifically, we assume that the data have a joint likelihood:
\[p(y_1, y_2, \cdots) = \prod_{i=1}^n \frac{\lambda(x_i)^{y_i} e^{-\lambda(x_i)}}{y_i!}\]and that the log of the rates are linear in the covariates:
\[\log(\lambda(x_i)) = x_i^T \beta\](modeling the logs as linear makes sure that the actual rates are always nonnegative, which is a requirement for the Poisson distribution).
We think of different regions of the covariate space as having more or less counts on average, depending on this function \(\lambda(x_i)\). Moving from xi to \(x_i + \delta_j\) multiplies the rate from \(\lambda(x_i)\) to \(\exp(\beta_j) \lambda(x_i).\)
As in logistic regression, it makes more sense to consider the deviance residuals when performing diagnostics.
In particular, a deficiency in Poisson regression models (which often motivates clients to show up to consulting) is that real data often exhibit overdipersion with respect the assumed Poisson model. The issue is that the mean and variance of counts in a Poisson model are tied together: if you sample from then the mean and variance of the associated counts are both \(\lambda\). In real data, the variance is often larger than the mean, so while the Poisson regression model might do a good job approximating the mean \(\lambda(x_i)\) at \(x_i\), the observed variance of the \(y_i\) near \(x_i\) might be much larger than \(\lambda (x_i)\). This motivates the methods in following subsection .
Pseudo-Poisson and negative binomial regression are two common strategies for addressing overdispersion in count data. In the pseudo-Poisson setup, a new parameter ‘ is introduced that sets the relative scale of the variance in comparison to the mean: \(\mathbb{V}\text{ar} (y) = \mathbb{E}[y]\). This is not associated with any real probability distribution, and the associated likelihood is called a pseudolikelihood. However, ‘ can be optimized along with from the usual Poisson regression setup to provide a maximum pseudolikelihood estimate. In negative binomial regression, the Poisson likelihood is abandoned alto- gether in favor of the negative binomial likelihood. Recall that the negative bi- nomial (like the Poisson) is a distribution on nonnegative counts \(\{0, 1, 2,\cdots \}\). It has two parameters, p and r, \(\mathbb{P}[y=k] ={ k+r-1 \choose k} p^k (1-p)^r\) which can be interpreted as the number of heads that appeared before seeing \(r\) tails, when flipping a coin with probability \(p\) of heads. More important than the specific form of the distribution is the fact that it has two parameters, which allow different variances even for the same mean:
\[\mathbb{E}_{p,r}[y] = \frac{pr}{1-p}\] \[\mathbb{V}\text{ar}_{p,r}[y] = \frac{pr}{(1-p)^2} = \mathbb{E}_{p,r}[y] + \frac{1}{2} \mathbb{E}_{p,r}[y]^2\]In particular, for small \(r\), the variance is much larger than the mean, while for large \(r\), the variance is about equal to the mean (it reduces to the Poisson). For negative binomial regression, this likelihood is substituted for the Poisson when doing regression, and the mean is allowed to depend on covariates. On the other hand, while the variance is no longer fixed to the mean, it must be the same across all data points. This likelihood model is not exactly a GLM (the negative binomial is not in the exponential family), but various methods for fitting it are available.
There is a connection between the negative binomial and the Poisson that both illuminates potential sources of overdispersion and suggests new algorithms for fitting overdispersed data: the negative binomial can be interpreted as a ``gamma mixture of Poissons.”
More specifically, in the hierarchical model:
\[y | \lambda \sim \text{Poisson}(\lambda)\] \[\lambda \sim \text{Gamma}(r, \frac{p}{1-p})\]which draws a Poisson with a randomly chosen mean parameter, the marginal distribution of \(y\) is exactly NegBin\((r, p)\). This suggests that one reason overdis- persed data might arise is that they are actually a mixture of true Poisson subpopulations, each with different mean parameters. This is also the starting point for Bayesian inference of overdispersed counts, which fit Poisson and Gamma distributions according to this hierarchical model.