Chapter 10: Multiple Linear Regression

Sections 10.4-10.7 - Terminology, Inference & Extensions

1 Sections 10.4-10.7: Terminology, Inference & Extensions


1.1 10.4 Multiple linear regression terminology

Multiple regression with two explanatory variables (\(x_1\) and \(x_2\)) allows for assessing the differences in expected \(y\) across observations that differ in \(x_1\) but are similar in terms of \(x_2\). This difference is called conditional on that other explanatory variable \(x_2\): difference in \(y\) by \(x_1\), conditional on \(x_2\). It is also called the controlled difference: difference in \(y\) by \(x_1\), controlling for \(x_2\). We often say that we condition on \(x_2\), or control for \(x_2\), when we include it in a multiple regression that focuses on average differences in \(y\) by \(x_1\). When we focus on \(x_1\) in the multiple regression, the other right-hand-side variable, \(x_2\), is called a covariate. In some cases, it is also called a confounder: if omitting \(x_2\) makes the slope on \(x_1\) different, it is said to confound the association of \(y\) and \(x_1\) (we’ll discuss confounders in Chapter 19, Section 19.3).

Multiple Linear Regression Terminology

🤖 AI PRACTICE TASK #5

Prompt: “Explain the difference between ‘conditioning on’ a variable and ‘confounding’ in regression analysis. Give a concrete example where omitting a confounder would lead to wrong conclusions.”

📋 COPY & OPEN IN CHATGPT


1.2 10.5 Standard errors and confidence intervals in multiple linear regression

The concept of statistical inference and the interpretation of confidence intervals in multiple regressions is similar to that in simple regressions. For example, the 95% confidence interval of slope of \(x_1\) in a multiple linear regression that conditions on another explanatory variable \(x_2\) (CI of \(\beta_1\) in \(y^E= \beta_0+\beta_1 x_1+\beta_2 x_2\)) shows where we can expect the coefficient in the population, or general pattern, represented by the data.

Similarly to the coefficients in the simple regression, the 95% CI of a slope in a multiple regression is the coefficient value estimated from the data plus-or-minus two standard errors. Again similarly to the simple regression case, we can get the standard error either by bootstrap or using an appropriate formula. And, as usual, the simple SE formula is not a good approximation in general: it assumes homoskedasticity (same fit of the regression over the range of the explanatory variables). There is a robust SE formula for multiple regression, too, that works in general, both under homoskedasticity and heteroskedasticity. Thus, just as with simple regressions we advise you to make the software calculate robust SE as default.

While not correct in general, the simple formula is good to examine because it shows what makes the SE larger in a simpler more intuitive way than the robust formula. The simple SE formula for the slope \(\hat{\beta}_{1}\) is

\[ SE(\hat{\beta}_{1}) = \frac{Std[e]}{\sqrt{n}Std(x_1) \sqrt{1-R_{1}^2}} \]

Similarly to the simple SE formula for the simple linear regression in Chapter 9, Section 9.3, this formula has \(\sqrt{n}\) in its denominator. But, similarly again to the simple linear regression, the correct number to divide with would be slightly different: the degrees of freedom instead of the number of the observations (see Chapter 9, Section 9.3.3). Here that would be \(\sqrt{n-k-1}\) where \(k\) is the number of right-hand-side variables in the regression. Similarly to the simple regression, this makes little practical difference in most cases. However, in contrast with the simple regression case, it may make a difference not only when we have too few observations, but also when we have many right-hand-side variables relative to the number of observations. We’ll ignore that issue for most of this textbook, but it will come back sometimes, as, for example, in Chapter 21, Section 21.3.

This formula is very similar to what we have for simple regressions in other details, too, except for that new \(\sqrt{1-R_{1}^2}\) term in the denominator. \(R_{1}^2\) is the R-squared of the regression of \(x_1\) on \(x_2\). Recall that the R-squared of a simple regression is the square of the correlation between the two variables in the regression. Thus, \(R_{1}^2\) is the correlation between \(x_1\) and \(x_2\). The stronger this correlation, the larger \(R_{1}^2\), the smaller \(\sqrt{1-R_{1}^2}\), but then the larger \(1/\sqrt{1-R_{1}^2}\) (\(\sqrt{1-R_{1}^2}\) is in the denominator). So, the stronger the correlation between \(x_1\) and \(x_2\), the larger the SE of \(\hat{\beta}_{1}\). Note the symmetry: the same would apply to the SE of \(\hat{\beta}_{2}\). As for the familiar terms in the formula: the SE is smaller, the smaller the standard deviation of the residuals (the better the fit of the regression), the larger the sample, and the larger the standard deviation of \(x_{1}\).

At the polar case of a correlation of one (or negative one) that corresponds to \(R^2_1 = 1\), the SE of the two coefficients does not exist. A correlation of one means that \(x_1\) and \(x_2\) are linear functions of each other. It is not only the SE formulae that cannot be computed in this case; the regression coefficients cannot be computed either. In this case the explanatory variables are said to be perfectly collinear.

Strong but imperfect correlation between explanatory variables is called multicollinearity. It allows for calculating the slope coefficients and their standard errors, but the standard errors may be large. Intuitively, this is because we would like to compare observations that are different in one of the variables but similar in the other. But strong correlation between the two implies that there are not many observations that are the same in one variable but different in the other variable. Therefore, there are just not enough valid observations for comparing average \(y\) across them. Indeed, the problem of multicollinearity is very similar to the problem of having too few observations in general. We can see it in the formula as well: the role of \((1-R^2)\) and \(n\) are the same.

Consider our example of estimating how sales of the main product of our company tend to change when our price changes but the prices of competitors do not. In that example our own price and the competitors’ prices tended to move together. That’s multicollinearity. One consequence of this is that omitting the change in the competitors’ price would lead to omitted variable bias; thus we need to include that in our regression. But here we see that multicollinearity has another consequence. Including both price variables in the regression makes the SE of the coefficient of our own price larger, and its confidence interval wider, too. Intuitively, that’s because there are few months when our price changes but the competitors’ prices don’t change, and it is changes in sales in those months that contain the valuable information for estimating the coefficient on our own price. Months when our own and competitors’ prices change the same way don’t help. So the reason why we want competitors’ price in our regression (strong co-movement) is exactly the reason for having imprecise estimates with wide confidence intervals.

That’s true in general, too. Unfortunately, there is not much we can do about multicollinearity in the data we have, just as there is not much we can do about having too few observations. More data helps both, of course, but that is not much help when we have to work with the data that’s available. Alternatively, we may decide to change the specification of the regression and drop one of the strongly correlated explanatory variables. However, that results in a different regression. Whether we want a different regression or not needs to be evaluated keeping the substantive question of the analysis in mind.

Inference in Multiple Regression

\[ SE(\hat{\beta}_{1})=\frac{Std[e]}{\sqrt{n}Std(x_1) \sqrt{1-R_{1}^2}} \]

where \(e\) is the residual \(e=y - \hat{\beta}_{0}+\hat{\beta}_{1} x_1 + \hat{\beta}_{2} x_2\) and \(R_{1}^2\) is the R-squared in the simple linear regression of \(x_1\) on \(x_2\).

🤖 AI PRACTICE TASK #6

Prompt: “I have two potential explanatory variables that are highly correlated (r = 0.95). Should I include both in my regression? Explain the trade-off between omitted variable bias and multicollinearity.”

📋 COPY & OPEN IN CHATGPT


1.3 10.6 Hypothesis testing in multiple linear regression

Testing hypotheses about coefficients in a multiple regression is also very similar to that in a simple regression. The standard errors are estimated in a different way but with the appropriate SE, all works just the same. For example, testing whether \(H_0: \beta_1 = 0\) against \(H_A: \beta_1 \ne 0\), we need the p-value or the t-statistic. Standard regression output produced by most statistical software shows those statistics. If our level of significance is 0.05, we reject \(H_0\) if the p-value is less than 0.05, or – which is the same information in a different form – the t-statistic is less than -2 or greater than +2.

Besides testing a hypothesis that involves a single coefficient, we sometimes test a hypothesis that involves more coefficients. As we explained in Chapter 9, Section 9.4, these come in two forms: a single null hypothesis about two or more coefficients (e.g., if they are equal), or a list of null hypotheses (e.g., that several slope coefficients are zero). The latter is called testing joint hypotheses.

Testing joint hypotheses are based on a test statistic called the F-statistic, and the related test is called the F-test. The underlying logic of hypothesis testing is the same here: reject the null if the test statistic is larger than a critical value, which shows that the estimated coefficients are too far from what’s in the null. The technical details are different. But the meaning of the p-value is the same as always. Thus, we advise getting the p-value when testing a joint hypothesis.

In fact, the test that asks whether all slope coefficients are zero in the regression has its own name: the global F-test, or simply “the” F-test. Its results are often shown by statistical software by default. More frequently, we use joint testing of joint hypotheses to decide whether a subset of the coefficients (such as all geographical variables) are all zero.

Similarly to testing hypotheses about single coefficients, the F-test needs appropriate standard error estimates. In cross sectional data, those appropriate estimates are usually the robust SE estimates.

Hypothesis Testing in Multiple Regression

Single Coefficient Tests (t-tests): - Test \(H_0: \beta_j = 0\) vs. \(H_A: \beta_j \ne 0\) - Use t-statistic or p-value from regression output - Reject \(H_0\) if p-value < 0.05 (or if |t| > 2 as rough guide)

Joint Hypothesis Tests (F-tests): - Test whether multiple coefficients are zero simultaneously - Example: \(H_0: \beta_1 = \beta_2 = \beta_3 = 0\) - Use F-statistic and its p-value - Global F-test: Tests whether ALL slope coefficients are zero

Critical: Always use robust standard errors for valid inference!

🤖 AI PRACTICE TASK #7

Prompt: “What’s the difference between testing whether β₁ = 0 and testing whether β₁ = β₂ = β₃ = 0? Why can’t I just do three separate t-tests for the second question?”

📋 COPY & OPEN IN CHATGPT


1.4 Case Study A2: Understanding the Gender Difference in Earnings

CASE STUDY A2: STATISTICAL INFERENCE

Statistical inference

Let’s revisit the results in Table 10.1, taking statistical inference into account. The data represents employees with a graduate degree in the U.S.A. in 2014. According to the estimate in column (1), women in this sample earn 19.5 percent less than men, on average. The appropriately estimated (robust) standard error is 0.008, implying a 95% CI of approximately [-0.21,-0.18]. We can be 95% confident that women earned 18 to 21 percent less, on average, than men among employees with graduate degrees in the U.S.A. in 2014.

Column (2) suggests that when we compare employees of the same age, women in this sample earn approximately 18.5 percent less than men, on average. The 95% CI is approximately [-0.20,-0.17]. It turns out that the estimated -0.195 in column (1) is within this CI, and the two CIs overlap. Thus it is very possible that there is no difference between these two coefficients in the population. We uncovered a difference in the data between the unconditional gender wage gap and the gender gap conditional on age. However, that difference is small. Moreover, it may not exist in the population. These two facts tend to go together: small differences are harder to pin down in the population, or general pattern, represented by the data. Often, that’s all right. Small differences are rarely very important. When they are, we need more precise estimates, which may come with larger sample size.

Understanding Confidence Intervals

The 95% CI of [-0.20, -0.17] for the conditional gender gap means: - We are 95% confident the true population difference is between -20% and -17% - This interval does NOT include zero, so we can be confident there is a real gender gap - The unconditional estimate of -19.5% falls within this interval - While we see a difference in point estimates (19.5% vs 18.5%), the CIs overlap, suggesting the difference may not be statistically significant

🤖 AI PRACTICE TASK #8

Prompt: “If two confidence intervals overlap, does that mean the coefficients are not significantly different? Explain why overlapping CIs are related to but not the same as a formal test of coefficient equality.”

📋 COPY & OPEN IN CHATGPT


1.5 10.7 Multiple linear regression with three or more explanatory variables

We spent a lot of time on multiple regression with two right-hand-side variables. That’s because that regression shows all the important differences between simple regression and multiple regression in intuitive ways. In practice, however, we rarely estimate regressions with exactly two right-hand-side variables. The number of right-hand-side variables in a multiple regression varies from case to case, but it’s typically more than two. In this section we describe multiple regressions with three or more right-hand-side variables. Their general form is

\[ y^E = \beta_0+\beta_1 x_1+\beta_2 x_2 +\beta_3 x_3+... \]

All of the results, language, and interpretations discussed so far carry forward to multiple linear regressions with three or more explanatory variables. Interpreting the slope of \(x_1\): on average, \(y\) is \(\beta_{1}\) units larger in the data for observations with one unit larger \(x_1\) but with the same value for all other \(x\) variables. The interpretation of the other slope coefficients is analogous. The language of multiple regression is the same, including the concepts of conditioning, controlling, omitted, or confounder variables.

The standard error of coefficients may be estimated by bootstrap or a formula. As always, the appropriate formula is the robust SE formula. But the simple formula contains the things that make even the robust SE larger or smaller. For any slope coefficient \(\hat{\beta}_{k}\) the simple SE formula is

\[ SE(\hat{\beta}_{k})=\frac{Std[e]}{\sqrt{n}Std[x_k]\sqrt{1-R_{k}^2}} \]

Almost all is the same as with two right-hand-side variables. In particular, The SE is smaller, the smaller the standard deviation of the residuals (the better the fit of the regression), the larger the sample, and the larger the standard deviation of \(x_k\). The new-looking thing is \(R_{k}^2\). But that’s simply the generalization of \(R_{1}^2\) in the previous formula. It is the R-squared of the regression of \(x_k\) on all other \(x\) variables. The smaller that R-squared, the smaller the SE.

Multiple Linear Regression with Three or More Variables

Equation: \(y^E = \beta_0+\beta_1 x_1+\beta_2 x_2 +\beta_3 x_3+...\)

Interpretation of \(\beta_{k}\) (slope of \(x_k\)): - On average, \(y\) is \(\beta_{k}\) units larger in the data for observations with one unit larger \(x_k\) but with the same value for all other x variables.

Standard Error: \[ SE(\hat{\beta}_{k})=\frac{Std[e]}{\sqrt{n}Std[x_k]\sqrt{1-R_{k}^2}} \]

where \(e\) is the regression residual and \(R_{k}^2\) is the R-squared of the regression of \(x_k\) on all other \(x\) variables.

🤖 AI PRACTICE TASK #9

Prompt: “I want to add a 5th explanatory variable to my regression. What should I consider before doing so? Discuss degrees of freedom, multicollinearity, and interpretation.”

📋 COPY & OPEN IN CHATGPT


1.6 10.8 Nonlinear patterns and multiple linear regression

In Chapter 8 we introduced piecewise linear splines, quadratics, and other polynomials to approximate a nonlinear \(y^E=f(x)\) regression.

From a substantive point of view, piecewise linear splines and polynomials of a single explanatory variable are not multiple regressions. They do not uncover differences with respect to one right-hand-side variable conditional on one or more other right-hand-side variables. Their slope coefficients cannot be interpreted as the coefficients of multiple regressions: it does not make sense to compare observations that have the same \(x\) but a different \(x^2\).

But such regressions are multiple linear regressions from a technical point of view. This means that the way their coefficients are calculated is the exact same way the coefficients of multiple linear regressions are calculated. Their standard errors are calculated the same way, too and so are their confidence intervals, test statistics, and p-values.

Testing hypotheses can be especially useful here, as it can help choose the functional form. With a piecewise linear spline, we can test whether the slopes are the same in adjacent line segments. If we can’t reject the null that they are the same, we may as well join them instead of having separate line segments. Testing hypotheses helps in choosing a polynomial, too. Here an additional complication is that the coefficients don’t have an easy interpretation in themselves. However, testing if all nonlinear coefficients are zero may help decide whether to include them at all.

However, testing hypotheses to decide whether to include a higher-order polynomial has its issues. Recall that a multiple linear regression requires that the right-hand-side variables are not perfectly collinear. In other words, they cannot be linear functions of each other. With a polynomial on the right-hand side, those variables are exact functions of each other: \(x^2\) is the square of \(x\). But they are not a linear function of each other, so, technically, they are not perfectly collinear. That’s why we can include both \(x\) and \(x^2\) and, if needed, its higher order terms, in a linear regression. While they are not perfectly collinear, explanatory variables in a polynomial are often highly correlated. That multicollinearity results in high standard errors, wide confidence intervals, and high p-values. As with all kinds of multicollinearity, there isn’t anything we can do about that once we have settled on a functional form.

Importantly, when thinking about functional form, we should always keep in mind the substantive focus of our analysis. As we emphasized in Chapter 8, Section 8.3, we should go back to that original focus when deciding whether we want to include a piecewise linear spline or a polynomial to approximate a nonlinear pattern. There we said that we want our regression to have a good approximation to a nonlinear pattern in \(x\) if our goal is prediction or analyzing residuals. We may not want that if all we care about is the average association between \(x\) and \(y\), except if that nonlinearity messes up the average association. This last point is a bit subtle, but usually means that we may want to transform variables to relative changes or take logs if the distribution of \(x\) or \(y\) is very skewed.

Here we have multiple \(x\) variables. Should we care about whether each is related to average \(y\) in a nonlinear fashion? The answer is the same as earlier: yes, if we want to do prediction or analyze residuals; no, if we care about average associations (except we may want to have transformed variables here, too). In addition, when we focus on a single average association (with, say, \(x_1\)) and all the other variables (\(x_2\), \(x_3\), …) are covariates to condition on, the only thing that matters is the coefficient on \(x_1\). Even if nonlinearities matter for \(x_2\) and \(x_3\) themselves, they only matter for us if they make a difference in the estimated coefficient on \(x_1\). Sometimes they do; very often they don’t.

When to Care About Nonlinear Patterns

Care about functional form when: - Your goal is prediction → Need accurate \(\hat{y}\) - You’re analyzing residuals → Need good overall fit - The distribution of \(x\) or \(y\) is highly skewed → Consider transformations (logs, etc.)

Don’t worry as much when: - You care only about average associations → Linear may be “good enough” - Nonlinearities in covariates don’t affect your coefficient of interest

Rule of thumb: If including nonlinear terms doesn’t meaningfully change your main coefficient, the linear specification may be adequate for your purpose.


1.7 Case Study A3: Understanding the Gender Difference in Earnings

CASE STUDY A3: NONLINEAR PATTERNS

Nonlinear patterns and multiple linear regression

This step in our case study illustrates the point we made in the previous section. The regressions in Table 10.1 enter age in linear ways. Using part of the same data, in Chapter 9, Section 9.2 we found that log earnings and age follow a nonlinear pattern. In particular, there we found that average log earnings are a positive and steep function of age for younger people, but the pattern becomes gradually flatter for the middle-aged and may become completely flat, or even negative, among older employees.

Should we worry about the non-linear age-earnings pattern when our question is the average earnings difference between men and women? We investigated the gender gap conditional on age. Table 10.2 shows the results for multiple ways of doing it. Column (1) shows the regression with the unconditional difference that we showed in Table 10.1, for reference. Column (2) enters age in linear form. Column (3) enters it as quadratic. Column (4) enters it as a fourth-order polynomial.

📊 REPLICATE TABLE 10.2
# R Code to replicate Table 10.2
library(estimatr)
library(modelsummary)

# Load data
data <- read.csv("cps_earnings_grad.csv")

# Estimate four models
model1 <- lm_robust(lnearnings ~ female, data = data)
model2 <- lm_robust(lnearnings ~ female + age, data = data)
model3 <- lm_robust(lnearnings ~ female + age + I(age^2), data = data)
model4 <- lm_robust(lnearnings ~ female + age + I(age^2) + I(age^3) + I(age^4), 
                   data = data)

# Display results
modelsummary(list(model1, model2, model3, model4),
             stars = c('***' = 0.01, '**' = 0.05, '*' = 0.1),
             gof_omit = "IC|Log|F|RMSE")

🚀 OPEN IN CODESPACE

Table 10.2: Gender differences in earnings – log earnings and age, various functional forms

Variable (1) ln w (2) ln w (3) ln w (4) ln w
female -0.195*** -0.185*** -0.180*** -0.180***
(0.008) (0.008) (0.008) (0.008)
age 0.007*** -0.024** -0.168***
(0.000) (0.010) (0.051)
age² 0.0003*** 0.008***
(0.0001) (0.003)
age³ -0.0001***
(0.00003)
age⁴ 0.000001**
(0.0000003)
Constant 3.514*** 3.198*** 3.716*** 6.924***
(0.006) (0.018) (0.224) (1.088)
Observations 18,241 18,241 18,241 18,241
R-squared 0.028 0.046 0.048 0.050

Note: All employees with a graduate degree. Robust standard error estimates in parentheses. *** p<0.01, ** p<0.05, * p<0.1
Source: cps-earnings dataset. 2014, U.S.A.

The unconditional difference is -19.5%; the conditional difference is -18% according to column (2), and the same -18% according to columns (3) and (4). The various estimates of the conditional difference are the same up to two digits of rounding, and all of them are within each others’ confidence intervals. Thus, apparently, the functional form for age does not really matter if we are interested in the average gender gap.

At the same time, all coefficient estimates of the high order polynomials are statistically significant, meaning that the nonlinear pattern is very likely true in the population and not just a chance event in the particular dataset. The R-squared of the more complicated regressions are larger. These indicate that the complicated polynomial specifications are better at capturing the patterns. That would certainly matter if our goal was to predict earnings. But it does not matter for uncovering the gender difference in average earnings.

Key Insight: Purpose Determines Functional Form

What we learned: - Adding nonlinear age terms (quadratic, quartic) increases R² from 0.046 to 0.050 - All polynomial coefficients are statistically significant - BUT the gender coefficient barely changes: -0.185 → -0.180 → -0.180

Conclusion: If your question is “what is the gender gap controlling for age?”, the linear specification is adequate. The nonlinear terms matter for prediction but not for estimating the gender gap.

General principle: Match your functional form to your analytical purpose!

🤖 AI PRACTICE TASK #10

Prompt: “I’m estimating the effect of education on wages, controlling for age. Should I include age as a polynomial? Walk me through how to decide, considering my research question.”

📋 COPY & OPEN IN CHATGPT


🔗 Explore Further

Interactive Dashboard: Visualize confidence intervals and hypothesis tests
Open Dashboard

Run the Analysis: Replicate all regressions yourself
Open in GitHub Codespace

Download the Data: cps-earnings dataset
Get Data