Chapter 10: Multiple Linear Regression

Why and how to carry out multiple linear regression analysis, and how to interpret its results

Motivation

There is a substantial difference in the average earnings of women and men in all countries. You want to understand more about the potential origins of that difference, focusing on employees with a graduate degree in your country. You have data on a large sample of employees with a graduate degree, with their earnings and some of their characteristics, such as age and the kind of graduate degree they have. Women and men differ in those characteristics, which may affect their earnings. How should you use this data to uncover gender difference that are not due to differences in those other characteristics? And can you use regression analysis to uncover patterns of associations between earnings and those other characteristics that may help understand the origins of gender differences in earnings?

You have analyzed your data on hotel prices in a particular city to find hotels that are underpriced relative to how close they are to the city center. But you have also uncovered differences in terms of other features of the hotels that measure quality and are related to price. How would you use this data to find hotels that are underpriced relative to all of their features? And how can you visualize the distribution of hotel prices relative to what price you would expect for their features in a way that helps identify underpriced hotels?

After understanding simple linear regression, we can turn to multiple linear regression, which has more than one explanatory variable. Multiple linear regression is the most used method to uncover patterns of associations between variables. There are multiple reasons to include more explanatory variables in a regression. We may be interested in uncovering patterns of association between y and other explanatory variables, which may help understand differences in terms of the x variable we are interested in most. Or, we may be interested in the effect of an x variable, but we want to compare observations that are different in x but similar in other variables. Finally, we may want to predict y, and we want to use more x variables to arrive at better predictions.

We discuss why and when we should estimate multiple regression, how to interpret its coefficients, and how to construct and interpret confidence intervals and test the coefficients. We discuss the relationship between multiple regression and simple regression. We explain that piecewise linear splines and polynomial regressions are technically multiple linear regressions without the same interpretation of the coefficients. We discuss how to include categorical explanatory variables as well as interactions that help uncover different slopes for groups. We include an informal discussions on how to decide what explanatory variables to include and in what functional form. Finally, we discuss why a typical multiple regression with observational data can get us closer to causal interpretation without fully uncovering it.

The first case study in this chapter, Understanding the gender difference in earnings, uses the cps-earnings data to illustrate the use of multiple regression to understand potential sources of gender differences in earnings. We also go back to our question on finding underpriced hotels relative to their location and quality in the case study Finding a good deal among hotels with multiple regression, using the hotels-vienna data, to illustrate the use of multiple regression in prediction and residual analysis.

Learning outcomes. After working through this chapter, you should be able to

🎯 Interactive Enhancements

This HTML version includes: - 🔗 Dashboard links to explore concepts visually - 💻 Codespace links to run analyses yourself - 🤖 AI practice tasks for personalized learning - 📊 Interactive figures and tables

1.1 10.1 Multiple regression: why and when?

There are three broad reasons to carry out multiple regression analysis instead of simple regression. The first is exploratory data analysis: we may want to uncover more patterns of association, typically to generate questions for subsequent analysis. The other two reasons are related to one of the two ultimate aims of data analysis: making a better prediction by explaining more of the variation, and getting closer to establishing cause and effect in observational data by comparing observations that are more comparable.

The first example of the introduction is about understanding the reasons of a difference. It’s a causal question of sorts: we are interested in what causes women to earn less than men. The second example is one of prediction: we want to capture average price related to hotel features that customers value in order to identify hotels that are inexpensive compared to what their price “should be.”

🤖 AI PRACTICE TASK #1

Prompt: “I’m studying the relationship between education and earnings. What are three variables I should ‘control for’ in a multiple regression, and why would omitting them bias my results?”

📋 COPY & OPEN IN CHATGPT

1.2 10.2 Multiple linear regression with two explanatory variables

Multiple regression analysis uncovers average y as a function of more than one x variable: y^E = f(x₁, x₂, ...). It can lead to better predictions of ŷ by considering more explanatory variables. It may improve the interpretation of slope coefficients by comparing observations that are different in terms of one of the x variables but similar in terms of all other x variables.

Multiple linear regression specifies a linear function of the explanatory variables for the average y. Let’s start with the simplest version with two explanatory variables:

This is the standard notation for a multiple regression: all coefficients are denoted by the Greek letter β, but they have subscripts. The intercept has subscript 0, the first explanatory variable has subscript 1, and the second explanatory variable has subscript 2.

Having another right-hand-side variable in the regression means that we take the value of that other variable into account when we compare observations. The slope coefficient on x₁ shows the difference in average y across observations with different values of x₁ but with the same value of x₂. Symmetrically, the slope coefficient on x₂ shows difference in average y across observations with different values of x₂ but with the same value of x₁. This way, multiple regression with two explanatory variables compares observations that are similar in one explanatory variable to see the differences related to the other explanatory variable.

The interpretation of the slope coefficients takes this into account. β₁ shows how much larger y is on average for observations in the data with one unit larger value of x₁ but the same value of x₂. β₂ shows how much larger y is on average for observations in the data with one unit larger value of x₂ but with the same value of x₁.

Multiple Linear Regression

Multiple linear regression with two explanatory variables

y^E = β₀ + β₁x₁ + β₂x₂

Interpretation of the coefficients

β₀ (intercept): Average y for observations if both x₁ and x₂ are zero in the data.
β₁ (slope of x₁): On average, y is β₁ units larger in the data for observations with one unit larger x₁ but with the same x₂.
β₂ (slope of x₂): On average, y is β₂ units larger in the data for observations with one unit larger x₂ but with the same x₁.

🤖 AI PRACTICE TASK #2

Prompt: “I ran a regression: Price = 50 + 10×Bedrooms + 5×SquareFeet. How would I interpret the coefficient on Bedrooms? Write it in plain English as if explaining to a non-statistician.”

📋 COPY & OPEN IN CHATGPT

1.3 10.3 Multiple regression and simple regression: Omitted variable bias

It is instructive to examine the difference in the slope coefficient of an explanatory variable x₁ when it is the only variable on the right-hand-side of a regression compared to when another explanatory variable, x₂, is included as well. In notation, the question is the difference between β in the simple linear regression

For example, we may use time series data on sales and prices of the main product of our company, and regress month-to-month change in log quantity sold (y) on month-to-month change in log price (x₁). Then β shows the average percentage change in sales when our price increases by one percent. For example, β̂ = −0.5 would show that sales tend to decrease by half of a percent when our price increases by one percent.

But we would like to know what happens to sales when we change our price but competitors don’t. The second regression has change in the log of the average price charged by our competitors (x₂) next to x₁. Here β₁ would answer our question: average percentage change in our sales in months when we increase our price by one percent, but competitors don’t change their price. Suppose that in that regression, we estimate β̂₁ = −3. That is, our sales tend to drop by 3 percent when we increase our price by 1 percent and our competitors don’t change their prices. Also suppose that the coefficient on x₂ is positive, β̂₂ = 3. That means that our sales tend to increase by 3 percent when our competitors increase their prices by 1 percent and our price doesn’t change.

As we’ll see, whether the answer to the two questions is the same will depend on whether x₁ and x₂ are related. To understand that relationship, let us introduce the regression of x₂ on x₁, called the x − x regression, where δ is the slope parameter:

In our own price–competitor price example, δ would tell us how much the two prices tend to move together. In particular, it tells us about the average percentage change in competitor price in months when our own price increases by one percent. Let’s suppose that we estimate it to be δ̂ = 0.8: competitors tend to increase their price by 0.8 of a percent when our company increases its price by one percent. The two prices tend to move together, in the same direction.

To link the two original regressions, plug this x − x regression back in the multiple regression (this step is fine even though that may not be obvious; see Section 10.14):

y^E = β₀ + β₁x₁ + β₂(γ + δx₁) = β₀ + β₂γ + (β₁ + β₂δ)x₁

Importantly, we find that with regards to x₁, the slope coefficients in the simple (β) and multiple regression (β₁) are different:

The slope of x₁ in a simple regression is different from its slope in the multiple regression, the difference being the product of its slope in the regression of x₂ on x₁ and the slope of x₂ in the multiple regression. Or, put simply, the slope in simple regression is different from the slope in multiple regression by the slope in the x − x regression times the slope of the other x in the multiple regression.

This difference is called the omitted variable bias. If we are interested in the coefficient on x₁ with x₂ in the regression, too, it’s the second regression that we need, the first regression is an incorrect regression as it omits x₂. Thus, the results from that first regression are biased, and the bias is caused by omitting x₂. We will discuss omitted variables bias in detail when discussing causal effects in Chapter 21, Section 21.3.

In our sales–own price–competitors’ price example, we had that β̂ = −0.5 and β̂₁ = −3, so that β̂ − β̂₁ = −0.5 − (−3) = +2.5. In this case our simple regression gives a biased, overestimated slope on x₁ compared to the multiple regression. Recall that we had δ̂ = 0.8 and β̂₂ = 3. Their product is approximately 2.5 (2.4, but that difference is due to rounding). This overestimation is the result of two things: a positive association between the two price changes (δ) and a positive association between competitor price and our own sales (β₂).

In general, the slope coefficient on x₁ in the two regressions is different unless x₁ and x₂ are uncorrelated (δ = 0) or the coefficient on x₂ is zero in the multiple regression (β₂ = 0). The slope in the simple regression is larger if x₂ and x₁ are positively correlated and β₂ is positive (or they are negatively correlated and β₂ is negative).

The intuition is the following. In the simple regression y^E = α + βx₁, we compare observations that are different in x₁ without considering whether they are different in x₂. If x₁ and x₂ are uncorrelated this does not matter. In this case observations that are different in x₁ are, on average, the same in x₂, and, symmetrically, observations that are different in x₂ are, on average, the same in x₁. Thus the extra step we take with the multiple regression to compare observations that are different in x₁ but similar in x₂ does not matter here.

If, however, x₁ and x₂ are correlated, comparing observations with or without the same x₂ value makes a difference. If they are positively correlated, observations with higher x₂ tend to have higher x₁. In the simple regression we ignore differences in x₂ and compare observations with different values of x₁. But higher x₁ values mean higher x₂ values, too. Corresponding differences in y may be due to differences in x₁ but also due to differences in x₂.

Key Takeaway on Omitted Variable Bias

When we omit an important variable from our regression: - The coefficient we estimate may be very different from the “true” effect - The direction and magnitude of bias depends on how the omitted variable relates to both y and the included x variables - Multiple regression helps us get closer to the right answer by including important covariates

🤖 AI PRACTICE TASK #3

Prompt: “Give me a real-world example (different from the textbook) where omitted variable bias would be a serious problem. Explain what the omitted variable is, what direction the bias would go, and why.”

📋 COPY & OPEN IN CHATGPT

1.4 Case Study A1: Understanding the Gender Difference in Earnings

We continue investigating patterns in earnings, gender, and age. The data is the same cps-earnings data that we used earlier in Chapter 9, Section 9.2: it is a representative sample of all people of age 15 to 85 in the U.S.A. in 2014.

Compared to Chapter 9, Section 9.2, where we focused on a single occupation, we broaden the scope of our investigation here to all employees with a graduate degree – that is, a degree higher than a 4-year college degree: these include professional, master’s, and doctoral degrees.

We use data on people of age 24 to 65 (to reflect the typical working age of people with these kinds of graduate degrees). We excluded the self-employed (their earnings is difficult to measure) and included those who reported 20 hours or more as their usual weekly time worked. We have 18,241 observations.

The dependent variable is log hourly earnings ln w. Table 10.1 shows the results from three regressions. (1) is a simple regression of ln w on a binary female variable. (2) is a multiple regression that includes age as well, in a linear fashion. (3) is a simple regression with age as the dependent variable and the female binary variable.

📊 REPLICATE TABLE 10.1

# R Code to replicate Table 10.1
library(estimatr)
library(modelsummary)

# Load data
data <- read.csv("cps_earnings_grad.csv")

# Estimate three models
model1 <- lm_robust(lnearnings ~ female, data = data)
model2 <- lm_robust(lnearnings ~ female + age, data = data)
model3 <- lm_robust(age ~ female, data = data)

# Display results
modelsummary(list(model1, model2, model3),
             stars = c('***' = 0.01, '**' = 0.05, '*' = 0.1),
             gof_omit = "IC|Log|F|RMSE",
             coef_rename = c("female" = "Female", "age" = "Age"))

🚀 OPEN IN CODESPACE

Variable	(1) ln w	(2) ln w	(3) age
female	-0.195***	-0.185***	-1.484***
	(0.008)	(0.008)	(0.159)
age		0.007***
		(0.000)
Constant	3.514***	3.198***	44.630***
	(0.006)	(0.018)	(0.116)

Observations	18,241	18,241	18,241
R-squared	0.028	0.046	0.005

Note: All employees with a graduate degree. Robust standard error estimates in parentheses. *** p<0.01, ** p<0.05, * p<0.1
Source: cps-earnings dataset. 2014, U.S.A.

1.4.1 Interpretation of Results

According to column (1), women in this sample earn 19.5 log points (around 21 percent) less than men, on average. Column (2) suggests that when we compare employees of the same age, women in this sample earn 18.5 log points (around 20 percent) less than men, on average. This is a slightly smaller gender difference than in column (1). While the log approximation is not perfect at these magnitudes, from now on, we will ignore the difference between log units and percent. For example, we will interpret a 0.195 coefficient as a 19.5 percent difference.

The estimated coefficients differ, and we know where the difference should come from: average difference in age. Let’s use the formula for the difference between the coefficient on female in the simple regression and in the multiple regression. Their difference is −0.195 − (−0.185) = −0.01. This should be equal to the product of the coefficient of female in the regression of age on female (our column (3)) and the coefficient on age in column (2): −1.48 × 0.007 ≈ −0.01. These two are indeed equal.

Intuitively, we can see that women of the same age have a slightly smaller earnings disadvantage in this data because they are somewhat younger, on average, and employees who are younger tend to earn less. Part of the earnings disadvantage of women is thus due to the fact that they are younger. This is a small part, though: around one percentage point of the 19.5% difference, which is a 5% share of the entire difference.

1.4.2 Why Are Women Younger on Average?

But why are women employees younger, on average, in the data? It’s because there are fewer female employees with graduate degrees over age 45 than below. Figure 10.1 shows two density plots overlaid: the age distributions of male and female employees with graduate degrees. There are relatively few below age 30. From age 30 and up, the age distribution is close to uniform for men. But not for women: the proportion of female employees with graduate degrees drops above age 45, and again above age 55.

📊 CREATE FIGURE 10.1

# R Code to create Figure 10.1
library(ggplot2)
library(viridis)

ggplot(data, aes(x = age, fill = factor(female, labels = c("Male", "Female")))) +
  geom_density(alpha = 0.6) +
  scale_fill_manual(values = c("#20C997", "#D63384"), name = "") +
  labs(x = "Age (years)", 
       y = "Density",
       title = "Age Distribution by Gender",
       subtitle = "Employees with Graduate Degrees") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "top",
        plot.title = element_text(face = "bold"))

🚀 OPEN IN CODESPACE

In principle, this could be due to two things: either there are fewer women with graduate degrees in the 45+ generation than among the younger ones, or fewer of them are employed (i.e., employed for 20 hours or more for pay, which is the criterion to be in our subsample). Further investigation reveals that it is the former: women are less likely to have a graduate degree if they were born before 1970 (those 45+ in 2014) in the U.S.A. The proportion of women working for pay for more than 20 hours is very similar among those below age 45 and above.

🤖 AI PRACTICE TASK #4

Prompt: “Looking at Table 10.1, explain why the coefficient on ‘female’ changes from -0.195 in column (1) to -0.185 in column (2). Use the omitted variable bias formula to verify this difference mathematically.”

📋 COPY & OPEN IN CHATGPT

🔗 Explore Further

Interactive Dashboard: Visualize multiple regression concepts and omitted variable bias
Open Dashboard

Run the Analysis: Replicate all regressions and create all figures yourself
Open in GitHub Codespace

Download the Data: cps-earnings dataset
Get Data

Chapter 10: Multiple Linear Regression

1 Chapter 10: Multiple Linear Regression

1.1 10.1 Multiple regression: why and when?

1.2 10.2 Multiple linear regression with two explanatory variables

1.3 10.3 Multiple regression and simple regression: Omitted variable bias

1.4 Case Study A1: Understanding the Gender Difference in Earnings

1.4.1 Interpretation of Results

1.4.2 Why Are Women Younger on Average?

📖 Support the Textbook