Chapter 10 Glossary

Key Terms and Concepts in Multiple Linear Regression

Chapter 10 Glossary

Quick reference for all key terms and concepts

How to Use This Glossary

Alphabetical order – Terms are organized A-Z for easy lookup
Bold terms – Important concepts from the chapter
One-sentence definitions – Clear, concise explanations
Cross-references – Links to related terms
Section references – Where to find detailed explanations in the chapter

Tip: Use Ctrl+F (Cmd+F on Mac) to search for specific terms

A

Average Treatment Effect (ATE)
The average difference in outcomes between treated and untreated units in a causal analysis framework.
See: Section 10.11

B

Bad Conditioning Variable (Bad Control Variable)
A variable that should not be included in a regression as a covariate because it is part of the causal mechanism or affected by the treatment variable, leading to biased estimates of causal effects.
See: Section 10.11

Binary Variable
A variable that takes only two values, typically coded as 0 and 1, used to represent categories or groups in regression analysis.
See: Dummy Variable

C

Case Study
A detailed empirical analysis using real data to illustrate concepts and methods, with six case studies (A1-A6 on gender earnings, B1 on hotels) featured in this chapter.
See: All sections

Categorical Variable
A qualitative variable that takes values from a finite set of categories (e.g., education level, occupation, industry), entered in regression using dummy variables.
See: Section 10.9

Ceteris Paribus
Latin phrase meaning “all else equal,” referring to holding all other variables constant when examining the effect of one variable on another.
See: Section 10.11

Coefficient
A numerical parameter in a regression equation that measures the relationship between an explanatory variable and the dependent variable.
See: Sections 10.2, 10.7

Conditional Difference
The difference in average y between observations with different values of x₁ but the same values of other explanatory variables (x₂, x₃, etc.).
See: Section 10.4

Confidence Interval (CI)
A range of values that, with a specified probability (typically 95%), contains the true population parameter being estimated.
See: Section 10.5

Confounder
A variable that is correlated with both the dependent variable and an explanatory variable of interest, potentially biasing estimates if omitted from the regression.
See: Section 10.4

Controlled Difference
Same as conditional difference; the difference in y by x₁ while controlling for other variables.
See: Section 10.4

Covariate
An explanatory variable included in a multiple regression to control for its effects, typically when the focus is on a different explanatory variable.
See: Section 10.4

D

Degrees of Freedom
The number of observations minus the number of parameters estimated, relevant for calculating standard errors and test statistics (n - k - 1 in multiple regression with k explanatory variables).
See: Section 10.5

Dependent Variable
The outcome variable (y) being explained or predicted in a regression; also called the target variable or left-hand-side variable.
See: All sections

Dummy Variable
A binary (0/1) variable used to represent categories of a qualitative variable in regression analysis.
See: Section 10.9

Dummy Variable Trap
The problem of perfect collinearity that arises when all categories of a qualitative variable are included as dummy variables, requiring one category to be omitted as the reference.
See: Section 10.9

E

Explanatory Variable
A variable (x) used to explain variation in the dependent variable; also called independent variable, predictor, or right-hand-side variable.
See: All sections

Expected Value
The average or mean value of y conditional on specific values of the explanatory variables, denoted as y^E.
See: Sections 10.2, 10.7

F

F-statistic
A test statistic used in hypothesis testing for joint hypotheses about multiple regression coefficients.
See: Section 10.6

F-test
A hypothesis test based on the F-statistic, used to test whether multiple coefficients are simultaneously equal to zero or satisfy other joint restrictions.
See: Section 10.6

Factor Variable
Another term for a categorical or qualitative variable.
See: Section 10.9

Fitted Value
The predicted value of y from a regression equation, denoted as ŷ, calculated by plugging observed x values into the estimated regression equation.
See: Section 10.12

Functional Form
The mathematical specification of how explanatory variables enter the regression, including whether they are linear, quadratic, logarithmic, or in other transformations.
See: Section 10.8

G

Global F-test
The F-test of the null hypothesis that all slope coefficients in a regression are zero, testing whether the regression as a whole explains any variation in y.
See: Section 10.6

H

Heteroskedasticity
A situation where the variance of regression residuals differs across observations, violating the homoskedasticity assumption of simple standard error formulas.
See: Section 10.5

Homoskedasticity
A situation where the variance of regression residuals is constant across all observations, an assumption required for simple standard error formulas to be valid.
See: Section 10.5

Hypothesis Test
A statistical procedure for determining whether data provide evidence against a null hypothesis, typically using a t-statistic or p-value.
See: Section 10.6

I

Interaction Term
A variable created by multiplying two explanatory variables, included in regression to allow the effect of one variable on y to differ by values of another variable.
See: Section 10.10

Intercept
The constant term (β₀) in a regression equation, representing the expected value of y when all explanatory variables equal zero.
See: Sections 10.2, 10.7

J

Joint Hypothesis
A hypothesis that involves multiple parameters simultaneously, such as testing whether several slope coefficients are all zero.
See: Section 10.6

L

Left-out Category
Same as reference category; the category of a qualitative variable not represented by a dummy variable in regression.
See: Section 10.9

Linear Regression
A regression model where the dependent variable is a linear function of the parameters (coefficients), though explanatory variables can be transformed nonlinearly.
See: All sections

Log-linear Model
A regression with log-transformed dependent variable, where coefficients can be interpreted approximately as percentage changes.
See: Case Studies A1-A6

M

Moderator Variable
In medical and social science terminology, a variable that affects the strength or direction of the relationship between an explanatory variable and the dependent variable.
See: Section 10.10

Multicollinearity
High but imperfect correlation among explanatory variables, leading to large standard errors and imprecise coefficient estimates.
See: Section 10.5

Multiple Linear Regression
A regression model with more than one explanatory variable: y^E = β₀ + β₁x₁ + β₂x₂ + … + βₖxₖ.
See: All sections

O

Observational Data
Data generated from non-experimental settings where the researcher does not control assignment to treatment, making causal inference more challenging than with experimental data.
See: Section 10.11

Omitted Variable
A variable that affects y and is correlated with included explanatory variables but is not included in the regression, potentially causing omitted variable bias.
See: Section 10.3

Omitted Variable Bias (OVB)
The bias in estimated coefficients that arises when a relevant variable is omitted from a regression; the bias equals δβ₂ where δ is the coefficient in the x-x regression and β₂ is the coefficient on the omitted variable in the multiple regression.
See: Section 10.3

Overfitting
Including too many explanatory variables or too flexible functional forms in a regression, leading to excellent fit in the sample but poor prediction for new observations.
See: Section 10.12

P

P-value
The probability of observing data as extreme as what was observed if the null hypothesis were true; small p-values (typically < 0.05) provide evidence against the null hypothesis.
See: Section 10.6

Perfect Collinearity
A situation where explanatory variables are exact linear functions of each other, making it impossible to estimate regression coefficients.
See: Section 10.5

Piecewise Linear Spline
A nonlinear functional form created by splitting the range of a variable into segments and allowing different linear slopes in each segment.
See: Section 10.8

Polynomial
A functional form that includes a variable and its powers (x, x², x³, etc.) to capture nonlinear patterns.
See: Section 10.8

Predicted Value
Same as fitted value; the value of y predicted by the regression equation for given values of explanatory variables.
See: Section 10.12

Prediction
Using a regression model to estimate the value of the dependent variable for observations with known values of explanatory variables but unknown y.
See: Section 10.12

Q

Qualitative Variable
A variable that takes values from a finite set of categories rather than numerical values; entered in regression using dummy variables.
See: Section 10.9

Quantitative Variable
A variable that takes numerical values, such as age, price, or distance.
See: Section 10.9

R

R-squared (R²)
The proportion of variation in y explained by the regression, ranging from 0 to 1; calculated as Var(ŷ)/Var(y) or 1 - Var(e)/Var(y).
See: Section 10.12

Reference Category (Reference Group)
The category of a qualitative variable that is omitted when creating dummy variables, serving as the baseline for comparison in interpreting dummy variable coefficients.
See: Section 10.9

Regression Line
The line (or curve) representing the expected value of y as a function of x values in a regression model.
See: All sections

Residual
The difference between the observed value of y and its predicted value from the regression: e = y - ŷ.
See: Section 10.12

Right-hand-side Variable
Same as explanatory variable; a variable on the right side of the regression equation.
See: All sections

Robust Standard Error
A standard error formula that produces valid inference under heteroskedasticity, recommended as the default choice over simple standard errors.
See: Section 10.5

S

Significance Level
The probability threshold (commonly 0.05 or 5%) used to determine whether to reject a null hypothesis based on p-values or test statistics.
See: Section 10.6

Simple Linear Regression
A regression model with only one explanatory variable: y^E = α + βx.
See: Section 10.3

Slope Coefficient
A parameter in a regression equation (β₁, β₂, etc.) measuring how much y changes, on average, when an explanatory variable increases by one unit.
See: Sections 10.2, 10.7

Standard Error (SE)
A measure of the uncertainty in a coefficient estimate, used to construct confidence intervals and test statistics.
See: Section 10.5

Statistical Inference
The process of using sample data to draw conclusions about population parameters, typically through confidence intervals and hypothesis tests.
See: Sections 10.5, 10.6

T

T-statistic
A test statistic calculated as the estimated coefficient divided by its standard error, used to test hypotheses about individual coefficients.
See: Section 10.6

Target Variable
Same as dependent variable; the variable being predicted or explained.
See: Section 10.12

U

Unconditional Difference
The difference in average y between observations with different values of x₁ without controlling for other variables.
See: Section 10.3

V

Variable Selection
The process of choosing which explanatory variables and functional forms to include in a regression model.
See: Sections 10.8, 10.11, 10.12

X

X-X Regression
A regression of one explanatory variable (x₂) on another explanatory variable (x₁), used to understand omitted variable bias.
See: Section 10.3

Y

Y-hat (ŷ)
Notation for the predicted or fitted value of y from a regression.
See: Section 10.12

Y-hat minus Y plot (ŷ - y plot)
A scatter plot with predicted values (ŷ) on the horizontal axis and actual values (y) on the vertical axis, used to visualize regression fit with the 45-degree line showing perfect predictions.
See: Section 10.12

Mathematical Notation

β (Beta)
Greek letter used to denote regression coefficients in population or theoretical models.
See: All sections

β̂ (Beta-hat)
Estimated regression coefficient from sample data.
See: All sections

e
Regression residual (e = y - ŷ).
See: Sections 10.5, 10.12

E (Superscript)
Denotes expected or average value, as in y^E for expected y.
See: All sections

n
Sample size (number of observations).
See: Section 10.5

k
Number of explanatory variables in a regression.
See: Section 10.5

Var[ ]
Variance of a variable.
See: Sections 10.5, 10.12

Std[ ]
Standard deviation of a variable.
See: Section 10.5

Common Abbreviations

CI
Confidence Interval
See: Section 10.5

OLS
Ordinary Least Squares (the standard method for estimating regression coefficients)
See: All sections

OVB
Omitted Variable Bias
See: Section 10.3

SE
Standard Error
See: Section 10.5

Dataset-Specific Terms

CPS (Current Population Survey)
The U.S. survey data used in Case Studies A1-A6 on gender earnings differences.
See: Case Studies A1-A6

Graduate Degree
In the CPS data, includes professional degrees (e.g., MD), master’s degrees (e.g., MA, MBA), and doctoral degrees (PhD), used to define the sample in Case Studies A1-A6.
See: Case Studies A1-A6

Log Earnings
Natural logarithm of earnings, used as the dependent variable in Case Studies A1-A6 to allow interpretation of coefficients as approximate percentage differences.
See: Case Studies A1-A6

Stars (Hotel Rating)
A categorical measure of hotel quality (3, 3.5, or 4 stars) used in Case Study B1.
See: Case Study B1

Customer Rating
Average customer review score for hotels, used as a continuous measure of quality in Case Study B1.
See: Case Study B1

Quick Reference by Topic

Core Concepts:
Multiple Linear Regression, Coefficient, Conditional Difference, Covariate, Omitted Variable Bias

Statistical Inference:
Standard Error, Confidence Interval, Hypothesis Test, P-value, T-statistic, F-test, Robust Standard Error

Variables and Data:
Dummy Variable, Reference Category, Interaction Term, Categorical Variable, Binary Variable

Model Specification:
Functional Form, Polynomial, Piecewise Linear Spline, Multicollinearity, Perfect Collinearity

Applications:
Prediction, Residual, R-squared, Y-hat minus Y plot, Overfitting, Ceteris Paribus

Causality:
Confounder, Bad Conditioning Variable, Observational Data, Omitted Variable

Most Critical Terms to Master

If you can only memorize a few terms, focus on these:

Multiple Linear Regression – The core method of this chapter
Omitted Variable Bias – Why simple regression can be misleading
Conditional/Controlled Difference – What multiple regression actually estimates
Robust Standard Error – Essential for valid inference
Interaction Term – Allows different slopes for different groups
Dummy Variable – How to include categories in regression
Ceteris Paribus – The ideal of “all else equal” comparison
Covariate – Variables we control for in analysis
R-squared – Measure of regression fit
Multicollinearity – Why high correlation among x variables causes problems

Master these ten concepts and you’ll understand 80% of what matters in multiple regression!

📚 Return to: Chapter 10 Index | Page 1: Foundation

Need more detail? Each term links to the section where it’s explained fully.

Chapter 10 Glossary

A

B

C

D

E

F

G

H

I

J

L

M

O

P

Q

R

S

T

U

V

X

Y

Mathematical Notation

Common Abbreviations

Related Concepts from Other Chapters

Dataset-Specific Terms