Chapter 10 Glossary
Key Terms and Concepts in Multiple Linear Regression
Chapter 10 Glossary
Quick reference for all key terms and concepts
- Alphabetical order – Terms are organized A-Z for easy lookup
- Bold terms – Important concepts from the chapter
- One-sentence definitions – Clear, concise explanations
- Cross-references – Links to related terms
- Section references – Where to find detailed explanations in the chapter
Tip: Use Ctrl+F (Cmd+F on Mac) to search for specific terms
A
Average Treatment Effect (ATE)
The average difference in outcomes between treated and untreated units in a causal analysis framework.
See: Section 10.11
B
Bad Conditioning Variable (Bad Control Variable)
A variable that should not be included in a regression as a covariate because it is part of the causal mechanism or affected by the treatment variable, leading to biased estimates of causal effects.
See: Section 10.11
Binary Variable
A variable that takes only two values, typically coded as 0 and 1, used to represent categories or groups in regression analysis.
See: Dummy Variable
C
Case Study
A detailed empirical analysis using real data to illustrate concepts and methods, with six case studies (A1-A6 on gender earnings, B1 on hotels) featured in this chapter.
See: All sections
Categorical Variable
A qualitative variable that takes values from a finite set of categories (e.g., education level, occupation, industry), entered in regression using dummy variables.
See: Section 10.9
Ceteris Paribus
Latin phrase meaning “all else equal,” referring to holding all other variables constant when examining the effect of one variable on another.
See: Section 10.11
Coefficient
A numerical parameter in a regression equation that measures the relationship between an explanatory variable and the dependent variable.
See: Sections 10.2, 10.7
Conditional Difference
The difference in average y between observations with different values of x₁ but the same values of other explanatory variables (x₂, x₃, etc.).
See: Section 10.4
Confidence Interval (CI)
A range of values that, with a specified probability (typically 95%), contains the true population parameter being estimated.
See: Section 10.5
Confounder
A variable that is correlated with both the dependent variable and an explanatory variable of interest, potentially biasing estimates if omitted from the regression.
See: Section 10.4
Controlled Difference
Same as conditional difference; the difference in y by x₁ while controlling for other variables.
See: Section 10.4
Covariate
An explanatory variable included in a multiple regression to control for its effects, typically when the focus is on a different explanatory variable.
See: Section 10.4
D
Degrees of Freedom
The number of observations minus the number of parameters estimated, relevant for calculating standard errors and test statistics (n - k - 1 in multiple regression with k explanatory variables).
See: Section 10.5
Dependent Variable
The outcome variable (y) being explained or predicted in a regression; also called the target variable or left-hand-side variable.
See: All sections
Dummy Variable
A binary (0/1) variable used to represent categories of a qualitative variable in regression analysis.
See: Section 10.9
Dummy Variable Trap
The problem of perfect collinearity that arises when all categories of a qualitative variable are included as dummy variables, requiring one category to be omitted as the reference.
See: Section 10.9
E
Explanatory Variable
A variable (x) used to explain variation in the dependent variable; also called independent variable, predictor, or right-hand-side variable.
See: All sections
Expected Value
The average or mean value of y conditional on specific values of the explanatory variables, denoted as y^E.
See: Sections 10.2, 10.7
F
F-statistic
A test statistic used in hypothesis testing for joint hypotheses about multiple regression coefficients.
See: Section 10.6
F-test
A hypothesis test based on the F-statistic, used to test whether multiple coefficients are simultaneously equal to zero or satisfy other joint restrictions.
See: Section 10.6
Factor Variable
Another term for a categorical or qualitative variable.
See: Section 10.9
Fitted Value
The predicted value of y from a regression equation, denoted as ŷ, calculated by plugging observed x values into the estimated regression equation.
See: Section 10.12
Functional Form
The mathematical specification of how explanatory variables enter the regression, including whether they are linear, quadratic, logarithmic, or in other transformations.
See: Section 10.8
G
Global F-test
The F-test of the null hypothesis that all slope coefficients in a regression are zero, testing whether the regression as a whole explains any variation in y.
See: Section 10.6
H
Heteroskedasticity
A situation where the variance of regression residuals differs across observations, violating the homoskedasticity assumption of simple standard error formulas.
See: Section 10.5
Homoskedasticity
A situation where the variance of regression residuals is constant across all observations, an assumption required for simple standard error formulas to be valid.
See: Section 10.5
Hypothesis Test
A statistical procedure for determining whether data provide evidence against a null hypothesis, typically using a t-statistic or p-value.
See: Section 10.6
I
Interaction Term
A variable created by multiplying two explanatory variables, included in regression to allow the effect of one variable on y to differ by values of another variable.
See: Section 10.10
Intercept
The constant term (β₀) in a regression equation, representing the expected value of y when all explanatory variables equal zero.
See: Sections 10.2, 10.7
J
Joint Hypothesis
A hypothesis that involves multiple parameters simultaneously, such as testing whether several slope coefficients are all zero.
See: Section 10.6
L
Left-out Category
Same as reference category; the category of a qualitative variable not represented by a dummy variable in regression.
See: Section 10.9
Linear Regression
A regression model where the dependent variable is a linear function of the parameters (coefficients), though explanatory variables can be transformed nonlinearly.
See: All sections
Log-linear Model
A regression with log-transformed dependent variable, where coefficients can be interpreted approximately as percentage changes.
See: Case Studies A1-A6
M
Moderator Variable
In medical and social science terminology, a variable that affects the strength or direction of the relationship between an explanatory variable and the dependent variable.
See: Section 10.10
Multicollinearity
High but imperfect correlation among explanatory variables, leading to large standard errors and imprecise coefficient estimates.
See: Section 10.5
Multiple Linear Regression
A regression model with more than one explanatory variable: y^E = β₀ + β₁x₁ + β₂x₂ + … + βₖxₖ.
See: All sections
O
Observational Data
Data generated from non-experimental settings where the researcher does not control assignment to treatment, making causal inference more challenging than with experimental data.
See: Section 10.11
Omitted Variable
A variable that affects y and is correlated with included explanatory variables but is not included in the regression, potentially causing omitted variable bias.
See: Section 10.3
Omitted Variable Bias (OVB)
The bias in estimated coefficients that arises when a relevant variable is omitted from a regression; the bias equals δβ₂ where δ is the coefficient in the x-x regression and β₂ is the coefficient on the omitted variable in the multiple regression.
See: Section 10.3
Overfitting
Including too many explanatory variables or too flexible functional forms in a regression, leading to excellent fit in the sample but poor prediction for new observations.
See: Section 10.12
P
P-value
The probability of observing data as extreme as what was observed if the null hypothesis were true; small p-values (typically < 0.05) provide evidence against the null hypothesis.
See: Section 10.6
Perfect Collinearity
A situation where explanatory variables are exact linear functions of each other, making it impossible to estimate regression coefficients.
See: Section 10.5
Piecewise Linear Spline
A nonlinear functional form created by splitting the range of a variable into segments and allowing different linear slopes in each segment.
See: Section 10.8
Polynomial
A functional form that includes a variable and its powers (x, x², x³, etc.) to capture nonlinear patterns.
See: Section 10.8
Predicted Value
Same as fitted value; the value of y predicted by the regression equation for given values of explanatory variables.
See: Section 10.12
Prediction
Using a regression model to estimate the value of the dependent variable for observations with known values of explanatory variables but unknown y.
See: Section 10.12
Q
Qualitative Variable
A variable that takes values from a finite set of categories rather than numerical values; entered in regression using dummy variables.
See: Section 10.9
Quantitative Variable
A variable that takes numerical values, such as age, price, or distance.
See: Section 10.9
R
R-squared (R²)
The proportion of variation in y explained by the regression, ranging from 0 to 1; calculated as Var(ŷ)/Var(y) or 1 - Var(e)/Var(y).
See: Section 10.12
Reference Category (Reference Group)
The category of a qualitative variable that is omitted when creating dummy variables, serving as the baseline for comparison in interpreting dummy variable coefficients.
See: Section 10.9
Regression Line
The line (or curve) representing the expected value of y as a function of x values in a regression model.
See: All sections
Residual
The difference between the observed value of y and its predicted value from the regression: e = y - ŷ.
See: Section 10.12
Right-hand-side Variable
Same as explanatory variable; a variable on the right side of the regression equation.
See: All sections
Robust Standard Error
A standard error formula that produces valid inference under heteroskedasticity, recommended as the default choice over simple standard errors.
See: Section 10.5
S
Significance Level
The probability threshold (commonly 0.05 or 5%) used to determine whether to reject a null hypothesis based on p-values or test statistics.
See: Section 10.6
Simple Linear Regression
A regression model with only one explanatory variable: y^E = α + βx.
See: Section 10.3
Slope Coefficient
A parameter in a regression equation (β₁, β₂, etc.) measuring how much y changes, on average, when an explanatory variable increases by one unit.
See: Sections 10.2, 10.7
Standard Error (SE)
A measure of the uncertainty in a coefficient estimate, used to construct confidence intervals and test statistics.
See: Section 10.5
Statistical Inference
The process of using sample data to draw conclusions about population parameters, typically through confidence intervals and hypothesis tests.
See: Sections 10.5, 10.6
T
T-statistic
A test statistic calculated as the estimated coefficient divided by its standard error, used to test hypotheses about individual coefficients.
See: Section 10.6
Target Variable
Same as dependent variable; the variable being predicted or explained.
See: Section 10.12
U
Unconditional Difference
The difference in average y between observations with different values of x₁ without controlling for other variables.
See: Section 10.3
V
Variable Selection
The process of choosing which explanatory variables and functional forms to include in a regression model.
See: Sections 10.8, 10.11, 10.12
X
X-X Regression
A regression of one explanatory variable (x₂) on another explanatory variable (x₁), used to understand omitted variable bias.
See: Section 10.3
Y
Y-hat (ŷ)
Notation for the predicted or fitted value of y from a regression.
See: Section 10.12
Y-hat minus Y plot (ŷ - y plot)
A scatter plot with predicted values (ŷ) on the horizontal axis and actual values (y) on the vertical axis, used to visualize regression fit with the 45-degree line showing perfect predictions.
See: Section 10.12
Mathematical Notation
β (Beta)
Greek letter used to denote regression coefficients in population or theoretical models.
See: All sections
β̂ (Beta-hat)
Estimated regression coefficient from sample data.
See: All sections
e
Regression residual (e = y - ŷ).
See: Sections 10.5, 10.12
E (Superscript)
Denotes expected or average value, as in y^E for expected y.
See: All sections
n
Sample size (number of observations).
See: Section 10.5
k
Number of explanatory variables in a regression.
See: Section 10.5
Var[ ]
Variance of a variable.
See: Sections 10.5, 10.12
Std[ ]
Standard deviation of a variable.
See: Section 10.5
Common Abbreviations
CI
Confidence Interval
See: Section 10.5
OLS
Ordinary Least Squares (the standard method for estimating regression coefficients)
See: All sections
OVB
Omitted Variable Bias
See: Section 10.3
SE
Standard Error
See: Section 10.5
Dataset-Specific Terms
CPS (Current Population Survey)
The U.S. survey data used in Case Studies A1-A6 on gender earnings differences.
See: Case Studies A1-A6
Graduate Degree
In the CPS data, includes professional degrees (e.g., MD), master’s degrees (e.g., MA, MBA), and doctoral degrees (PhD), used to define the sample in Case Studies A1-A6.
See: Case Studies A1-A6
Log Earnings
Natural logarithm of earnings, used as the dependent variable in Case Studies A1-A6 to allow interpretation of coefficients as approximate percentage differences.
See: Case Studies A1-A6
Stars (Hotel Rating)
A categorical measure of hotel quality (3, 3.5, or 4 stars) used in Case Study B1.
See: Case Study B1
Customer Rating
Average customer review score for hotels, used as a continuous measure of quality in Case Study B1.
See: Case Study B1
Core Concepts:
Multiple Linear Regression, Coefficient, Conditional Difference, Covariate, Omitted Variable Bias
Statistical Inference:
Standard Error, Confidence Interval, Hypothesis Test, P-value, T-statistic, F-test, Robust Standard Error
Variables and Data:
Dummy Variable, Reference Category, Interaction Term, Categorical Variable, Binary Variable
Model Specification:
Functional Form, Polynomial, Piecewise Linear Spline, Multicollinearity, Perfect Collinearity
Applications:
Prediction, Residual, R-squared, Y-hat minus Y plot, Overfitting, Ceteris Paribus
Causality:
Confounder, Bad Conditioning Variable, Observational Data, Omitted Variable
If you can only memorize a few terms, focus on these:
- Multiple Linear Regression – The core method of this chapter
- Omitted Variable Bias – Why simple regression can be misleading
- Conditional/Controlled Difference – What multiple regression actually estimates
- Robust Standard Error – Essential for valid inference
- Interaction Term – Allows different slopes for different groups
- Dummy Variable – How to include categories in regression
- Ceteris Paribus – The ideal of “all else equal” comparison
- Covariate – Variables we control for in analysis
- R-squared – Measure of regression fit
- Multicollinearity – Why high correlation among x variables causes problems
Master these ten concepts and you’ll understand 80% of what matters in multiple regression!
📚 Return to: Chapter 10 Index | Page 1: Foundation
Need more detail? Each term links to the section where it’s explained fully.