Chapters

Each chapter provides summaries, outline, slides, and case study links.

Part	Ch.	Title	Links
I: Data Exploration	01	Origins of Data	slides
	02	Preparing Data for Analysis	slides
	03	Exploratory Data Analysis	slides
	04	Comparison and Correlation	slides
	05	Generalizing from Data	slides
	06	Testing Hypotheses	slides
II: Regression Analysis	07	Simple Regression	slides
	08	Complicated Patterns and Messy Data	slides
	09	Generalizing Results of a Regression	slides
	10	Multiple Linear Regression	slides
	11	Modeling Probabilities	slides
	12	Regression with Time Series Data	slides
III: Prediction	13	A Framework for Prediction	slides
	14	Model Building for Prediction	slides
	15	Regression Trees	slides
	16	Random Forest and Boosting	slides
	17	Probability Prediction and Classification	slides
	18	Forecasting from Time Series Data	slides
IV: Causal Analysis	19	A Framework for Causal Analysis	slides
	20	Designing and Analyzing Experiments	slides
	21	Regression and Matching with Observational Data	slides
	22	Difference-in-Differences	slides
	23	Methods for Panel Data	slides
	24	Appropriate Control Groups for Panel Data	slides

Downloads: Full contents (PDF), Index (PDF), Sample Chapters 10 & 14
Slides: For LaTeX versions, contact us.

PART I: DATA EXPLORATION

Chapter 01: Origins of Data

This chapter is about data collection and data quality. More

chapter outline → slides CH01A CH01B CH01C

Section	Title
1.1	What Is Data?
1.2	Data Structures
1.A	CASE STUDY – Finding a Good Deal among Hotels: Data Collection
1.3	Data Quality
1.B	CASE STUDY – Comparing Online and Offline Prices: Data Collection
1.C	CASE STUDY – Management Quality: Data Collection
1.4	How Data Is Born: The Big Picture
1.5	Collecting Data from Existing Sources
1.6	Surveys
1.7	Sampling
1.8	Random Sampling
1.9	Big Data
1.10	Good Practices in Data Collection
1.11	Ethical and Legal Issues of Data Collection
1.12	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading

Chapter 02: Preparing Data for Analysis

This chapter is about preparing data for analysis: how to start working with data. More

chapter outline → slides CH02A CH02B CH02C

Section	Title
2.1	Types of Variables
2.2	Stock Variables, Flow Variables
2.3	Types of Observations
2.4	Tidy Data
2.A	CASE STUDY – Finding a Good Deal among Hotels: Data Preparation
2.5	Tidy Approach for Multi-dimensional Data
2.B	CASE STUDY – Displaying Immunization Rates across Countries
2.6	Relational Data and Linking Data Tables
2.C	CASE STUDY – Identifying Successful Football Managers
2.7	Entity Resolution: Duplicates, Ambiguous Identification, and Non-entity Rows
2.8	Discovering Missing Values
2.9	Managing Missing Values
2.10	The Process of Cleaning Data
2.11	Reproducible Workflow: Write Code and Document Your Steps
2.12	Organizing Data Tables for a Project
2.13	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading
2.U1	Under the Hood: Naming Files

Chapter 03: Exploratory Data Analysis

The chapter starts with exploratory data analysis is important. More

chapter outline → slides CH03A CH03B CH03C CH03D CH03U1

Section	Title
3.1	Why Do Exploratory Data Analysis?
3.2	Frequencies and Probabilities
3.3	Visualizing Distributions
3.A	CASE STUDY – Finding a Good Deal among Hotels: Data Exploration
3.4	Extreme Values
3.5	Good Graphs: Guidelines for Data Visualization
3.6	Summary Statistics for Quantitative Variables
3.B	CASE STUDY – Comparing Hotel Prices in Europe: Vienna vs. London
3.7	Visualizing Summary Statistics
3.C	CASE STUDY – Measuring Home Team Advantage in Football
3.8	Good Tables
3.9	Theoretical Distributions
3.D	CASE STUDY – Distributions of Body Height and Income
3.10	Steps of Exploratory Data Analysis
3.11	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading
3.U1	Under the Hood: More on Theoretical Distributions
	Bernoulli Distribution
	Binomial Distribution
	Uniform Distribution
	Power-Law Distribution

Chapter 04: Comparison and Correlation

Most methods of data analysis are based on comparing values of one variable, y, across observations with different values of another variable, x, or more such variables. This chapter introduces simple methods of such comparison. More

chapter outline → slides CH04A

Section	Title
4.1	The y and the x
4.A	CASE STUDY – Management Quality and Firm Size: Describing Patterns of Association
4.2	Conditioning
4.3	Conditional Probabilities
4.4	Conditional Distribution, Conditional Expectation
4.5	Conditional Distribution, Conditional Expectation with Quantitative x
4.6	Dependence, Covariance, Correlation
4.7	From Latent Variables to Observed Variables
4.8	Sources of Variation in x
4.9	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading
4.U1	Under the Hood: Inverse Conditional Probabilities, Bayes’ Rule

Chapter 05: Generalizing from Data

This chapter introduces the conceptual issues with generalizing results from our data to the general pattern we care about and methods of statistical inference. More

chapter outline → slides CH05A

Section	Title
5.1	Why Generalize from Data?
5.2	Repeated Samples, Estimands, and Estimators
5.3	Sampling Distributions and Standard Error
5.4	Confidence Intervals
5.A	CASE STUDY – What Likelihood of Loss to Expect on a Stock Portfolio?
5.5	Estimating SE: The Bootstrap
5.6	Estimating SE: Standard Error Formulas
5.7	External Validity
5.8	Assessing External Validity in Practice
5.9	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading

Chapter 06: Testing Hypotheses

This chapter introduces the logic and practice of testing hypotheses. More

chapter outline → slides CH06A CH06B

Section	Title
6.1	The Logic of Testing Hypotheses
6.A	CASE STUDY – Comparing Online and Offline Prices: Testing the Difference
6.2	Null Hypothesis, Alternative Hypothesis
6.3	The t-Test
6.4	Making a Decision; False Negatives, False Positives
6.5	The p-Value
6.6	Steps of Hypothesis Testing
6.7	One-Sided Alternatives
6.B	CASE STUDY – Testing the Likelihood of Loss on a Stock Portfolio
6.8	Testing Multiple Hypotheses
6.9	p-Hacking
6.10	Testing Hypotheses with Big Data
6.11	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading

PART II: REGRESSION ANALYSIS

Chapter 07: Simple Regression

In this chapter, we introduce simple non-parametric regression and simple linear regression. More

chapter outline → slides CH07A

Section	Title
7.1	When and Why Do Simple Regression Analysis?
7.2	Regression: Definition
7.3	Non-parametric Regression
7.A	CASE STUDY – Finding a Good Deal among Hotels with Simple Regression
7.4	Linear Regression: Introduction
7.5	Linear Regression: Coefficient Interpretation
7.6	Linear Regression with a Binary Explanatory Variable
7.7	Coefficient Formula
7.8	Predicted Dependent Variable and Regression Residual
7.9	Goodness of Fit, R-Squared
7.10	Correlation and Linear Regression
7.11	Regression Analysis, Regression toward the Mean, Mean Reversion
7.12	Regression and Causation
7.13	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading
7.U1	Under the Hood: Derivation of the OLS Formulae for the Intercept and Slope Coefficients
7.U2	Under the Hood: More on Residuals and Predicted Values with OLS

Chapter 08: Complicated Patterns and Messy Data

The first part of this chapter covers how linear regression analysis can accommodate nonlinear patterns. More

chapter outline → slides CH08A CH08B CH08C

Section	Title
8.1	When and Why Care about the Shape of the Association between y and x?
8.2	Taking Relative Differences or Log
8.3	Log Transformation and Non-positive Values
8.4	Interpreting Log Values in a Regression
8.A	CASE STUDY – Finding a Good Deal among Hotels with Nonlinear Function
8.5	Other Transformations of Variables
8.B	CASE STUDY – How is Life Expectancy Related to the Average Income of a Country?
8.6	Regression with a Piecewise Linear Spline
8.7	Regression with Polynomial
8.8	Choosing a Functional Form in a Regression
8.9	Extreme Values and Influential Observations
8.10	Measurement Error in Variables
8.11	Classical Measurement Error
8.C	CASE STUDY – Hotel Ratings and Measurement Error
8.12	Non-classical Measurement Error and General Advice
8.13	Using Weights in Regression Analysis
8.14	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading
8.U1	Under the Hood: Details of the Log Approximation
8.U2	Under the Hood: Deriving the Consequences of Classical Measurement Error

Chapter 09: Generalizing Results of a Regression

This chapter discusses the methods of generalizing results of a linear regression from our data to the general pattern we care about. More

chapter outline → slides CH09A CH09B

Section	Title
9.1	Generalizing Linear Regression Coefficients
9.2	Statistical Inference: CI and SE of Regression Coefficients
9.A	CASE STUDY – Estimating Gender and Age Differences in Earnings
9.3	Intervals for Predicted Values
9.4	Testing Hypotheses about Regression Coefficients
9.5	Testing More Complex Hypotheses
9.6	Presenting Regression Results
9.7	Data Analysis to Help Assess External Validity
9.B	CASE STUDY – How Stable is the Hotel Price–Distance to Center Relationship?
9.8	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading
9.U1	Under the Hood: The Simple SE Formula for Regression Intercept
9.U2	Under the Hood: The Law of Large Numbers for ˆβ
9.U3	Under the Hood: Deriving SE(ˆβ) with the Central Limit Theorem
9.U4	Under the Hood: Degrees of Freedom Adjustment for the SE Formula

Chapter 10: Multiple Linear Regression

This chapter introduces multiple regression. More

chapter outline → slides CH10A CH10B

Section	Title
10.1	Multiple Regression: Why and When?
10.2	Multiple Linear Regression with Two Explanatory Variables
10.3	Multiple Regression and Simple Regression: Omitted Variable Bias
10.A	CASE STUDY – Understanding the Gender Difference in Earnings
10.4	Multiple Linear Regression Terminology
10.5	Standard Errors and Confidence Intervals in Multiple Linear Regression
10.6	Hypothesis Testing in Multiple Linear Regression
10.7	Multiple Linear Regression with Three or More Explanatory Variables
10.8	Nonlinear Patterns and Multiple Linear Regression
10.9	Qualitative Right-Hand-Side Variables
10.10	Interactions: Uncovering Different Slopes across Groups
10.11	Multiple Regression and Causal Analysis
10.12	Multiple Regression and Prediction
10.B	CASE STUDY – Finding a Good Deal among Hotels with Multiple Regression
10.13	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading
10.U1	Under the Hood: A Two-Step Procedure to Get the Multiple Regression Coefficient

Chapter 11: Modeling Probabilities

This chapter introduces probability models that have a binary dependent variable. More

chapter outline → slides CH11A CH11B

Section	Title
11.1	The Linear Probability Model
11.2	Predicted Probabilities in the Linear Probability Model
11.A	CASE STUDY – Does Smoking Pose a Health Risk?
11.3	Logit and Probit
11.4	Marginal Differences
11.5	Goodness of Fit: R-Squared and Alternatives
11.6	The Distribution of Predicted Probabilities
11.7	Bias and Calibration
11.B	CASE STUDY – Are Australian Weather Forecasts Well Calibrated?
11.8	Refinement
11.9	Using Probability Models for Other Kinds of y Variables
11.10	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading
11.U1	Under the Hood: Saturated Models
11.U2	Under the Hood: Maximum Likelihood Estimation and Search Algorithms
11.U3	Under the Hood: From Logit and Probit Coefficients to Marginal Differences

Chapter 12: Regression with Time Series Data

In this chapter we discuss the opportunities and challenges brought about by regression analysis of time series data and how to address those challenges. More

chapter outline → slides CH12A CH12B

Section	Title
12.1	Preparation of Time Series Data
12.2	Trend and Seasonality
12.3	Stationarity, Non-stationarity, Random Walk
12.A	CASE STUDY – Returns on a Company Stock and Market Returns
12.4	Time Series Regression
12.5	Trends, Seasonality, Random Walks in a Regression
12.B	CASE STUDY – Electricity Consumption and Temperature
12.6	Serial Correlation
12.7	Dealing with Serial Correlation in Time Series Regressions
12.8	Lags of x in a Time Series Regression
12.9	The Process of Time Series Regression Analysis
12.10	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading
12.U1	Under the Hood: Testing for Unit Root

PART III: PREDICTION

Chapter 13: A Framework for Prediction

This chapter introduces a framework for prediction. More

chapter outline → slides CH13A

Section	Title
13.1	Prediction Basics
13.2	Various Kinds of Prediction
13.A	CASE STUDY – Predicting Used Car Value with Linear Regressions
13.3	The Prediction Error and Its Components
13.4	The Loss Function
13.5	Mean Squared Error (MSE) and Root Mean Squared Error (RMSE)
13.6	Bias and Variance of Predictions
13.7	The Task of Finding the Best Model
13.8	Finding the Best Model by Best Fit and Penalty: The BIC
13.9	Finding the Best Model by Training and Test Samples
13.10	Finding the Best Model by Cross-Validation
13.11	External Validity and Stable Patterns
13.12	Machine Learning and the Role of Algorithms
13.13	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading

Chapter 14: Model Building for Prediction

This chapter discusses how to build regression models for prediction and how to evaluate the predictions they produce. More

chapter outline → slides CH14A CH14B

Section	Title
14.1	Steps of Prediction
14.2	Sample Design
14.3	Label Engineering and Predicting Log y
14.A	CASE STUDY – Predicting Used Car Value: Log Prices
14.4	Feature Engineering: Dealing with Missing Values
14.5	Feature Engineering: What x Variables to Have and in What Functional Form
14.B	CASE STUDY – Predicting Airbnb Apartment Prices: Selecting a Regression Model
14.6	We Can’t Try Out All Possible Models
14.7	Evaluating the Prediction Using a Holdout Set
14.8	Selecting Variables in Regressions by LASSO
14.9	Diagnostics
14.10	Prediction with Big Data
14.11	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading
14.U1	Under the Hood: Text Parsing
14.U2	Under the Hood: Log Correction

Chapter 15: Regression Trees

This chapter introduces the regression tree, an alternative to linear regression for prediction purposes that can find the most important predictor variables and their interactions and can approximate any functional form automatically. More

chapter outline → slides CH15A

Section	Title
15.1	The Case for Regression Trees
15.2	Regression Tree Basics
15.3	Measuring Fit and Stopping Rules
15.A	CASE STUDY – Predicting Used Car Value with a Regression Tree
15.4	Regression Tree with Multiple Predictor Variables
15.5	Pruning a Regression Tree
15.6	A Regression Tree is a Non-parametric Regression
15.7	Variable Importance
15.8	Pros and Cons of Using a Regression Tree for Prediction
15.9	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading

Chapter 16: Random Forest and Boosting

This chapter introduces two ensemble methods based on regression trees: the random forest and boosting. More

chapter outline → slides CH16A

Section	Title
16.1	From a Tree to a Forest: Ensemble Methods
16.2	Random Forest
16.3	The Practice of Prediction with Random Forest
16.A	CASE STUDY – Predicting Airbnb Apartment Prices with Random Forest
16.4	Diagnostics: The Variable Importance Plot
16.5	Diagnostics: The Partial Dependence Plot
16.6	Diagnostics: Fit in Various Subsets
16.7	An Introduction to Boosting and the GBM Model
16.8	A Review of Different Approaches to Predict a Quantitative y
16.9	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading

Chapter 17: Probability Prediction and Classification

This chapter introduces the framework and methods of probability prediction and classification analysis for binary y variables. More

chapter outline → slides CH17A

Section	Title
17.1	Predicting a Binary y: Probability Prediction and Classification
17.A	CASE STUDY – Predicting Firm Exit: Probability and Classification
17.2	The Practice of Predicting Probabilities
17.3	Classification and the Confusion Table
17.4	Illustrating the Trade-Off between Different Classification Thresholds: The ROC Curve
17.5	Loss Function and Finding the Optimal Classification Threshold
17.6	Probability Prediction and Classification with Random Forest
17.7	Class Imbalance
17.8	The Process of Prediction with a Binary Target Variable
17.9	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading
17.U1	Under the Hood: The Gini Node Impurity Measure and MSE
17.U2	Under the Hood: On the Method of Finding an Optimal Threshold

Chapter 18: Forecasting from Time Series Data

This chapter discusses forecasting: prediction from time series data for one or more time periods in the future. More

chapter outline → slides CH18A CH18B

Section	Title
18.1	Forecasting: Prediction Using Time Series Data
18.2	Holdout, Training, and Test Samples in Time Series Data
18.3	Long-Horizon Forecasting: Seasonality and Predictable Events
18.4	Long-Horizon Forecasting: Trends
18.A	CASE STUDY – Forecasting Daily Ticket Volumes for a Swimming Pool
18.5	Forecasting for a Short Horizon Using the Patterns of Serial Correlation
18.6	Modeling Serial Correlation: AR(1)
18.7	Modeling Serial Correlation: ARIMA
18.B	CASE STUDY – Forecasting a Home Price Index
18.8	VAR: Vector Autoregressions
18.9	External Validity of Forecasts
18.10	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading
18.U1	Under the Hood: Details of the ARIMA Model
18.U2	Under the Hood: Auto-Arima

PART IV: CAUSAL ANALYSIS

Chapter 19: A Framework for Causal Analysis

This chapter introduces a framework for causal analysis. More

chapter outline → slides CH19A

Section	Title
19.1	Intervention, Treatment, Subjects, Outcomes
19.2	Potential Outcomes
19.3	The Individual Treatment Effect
19.4	Heterogeneous Treatment Effects
19.5	ATE: The Average Treatment Effect
19.6	Average Effects in Subgroups and ATET
19.7	Quantitative Causal Variables
19.A	CASE STUDY – Food and Health
19.8	Ceteris Paribus: Other Things Being the Same
19.9	Causal Maps
19.10	Comparing Different Observations to Uncover Average Effects
19.11	Random Assignment
19.12	Sources of Variation in the Causal Variable
19.13	Experimenting versus Conditioning
19.14	Confounders in Observational Data
19.15	From Latent Variables to Measured Variables
19.16	Bad Conditioners: Variables Not to Condition On
19.17	External Validity, Internal Validity
19.18	Constructive Skepticism
19.19	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading

Chapter 20: Designing and Analyzing Experiments

This chapter discusses the most important questions about designing an experiment and analyzing data from an experiment to estimate the average effect of an intervention. More

chapter outline → slides CH20A CH20B

Section	Title
20.1	Randomized Experiments and Potential Outcomes
20.2	Field Experiments, A/B Testing, Survey Experiments
20.A	CASE STUDY – Working from Home and Employee Performance
20.B	CASE STUDY – Fine Tuning Social Media Advertising
20.3	The Experimental Setup: Definitions
20.4	Random Assignment in Practice
20.5	Number of Subjects and Proportion Treated
20.6	Random Assignment and Covariate Balance
20.7	Imperfect Compliance and Intent-to-Treat
20.8	Estimation and Statistical Inference
20.9	Including Covariates in a Regression
20.10	Spillovers
20.11	Additional Threats to Internal Validity
20.12	External Validity, and How to Use the Results in Decision Making
20.13	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading
20.U1	Under the Hood: LATE: The Local Average Treatment Effect
20.U2	Under the Hood: The Formula for Sample Size Calculation

Chapter 21: Regression and Matching with Observational Data

In this chapter we discuss how to condition on potential confounder variables in practice, and how to interpret the results when our question is causal. More

chapter outline → slides CH21A

Section	Title
21.1	Thought Experiments
21.A	CASE STUDY – Founder/Family Ownership and Quality of Management
21.2	Variables to Condition on, Variables Not to Condition On
21.3	Conditioning on Confounders by Regression
21.4	Selection of Variables and Functional Form in a Regression for Causal Analysis
21.5	Matching
21.6	Common Support
21.7	Matching on the Propensity Score
21.8	Comparing Linear Regression and Matching
21.9	Instrumental Variables
21.10	Regression-Discontinuity
21.11	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading
21.U1	Under the Hood: Unobserved Heterogeneity and Endogenous x in a Regression
21.U2	Under the Hood: LATE is IV

Chapter 22: Difference-in-Differences

This chapter introduces difference-in-differences analysis, or diff-in-diffs for short, and its use in understanding the effect of an intervention. More

chapter outline → slides CH22A

Section	Title
22.1	Conditioning on Pre-intervention Outcomes
22.2	Basic Difference-in-Differences Analysis: Comparing Average Changes
22.A	CASE STUDY – How Does a Merger between Airlines Affect Prices?
22.3	The Parallel Trends Assumption
22.4	Conditioning on Additional Confounders in Diff-in-Diffs Regressions
22.5	Quantitative Causal Variable
22.6	Difference-in-Differences with Pooled Cross-Sections
22.7	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading

Chapter 23: Methods for Panel Data

This chapter introduces the most widely used regression methods to uncover the effect of an intervention when observational time series (tseries) data or cross-section time-series (xt) panel data is available with more than two time periods. More

chapter outline → slides CH23A CH23B

Section	Title
23.1	Multiple Time Periods Can Be Helpful
23.2	Estimating Effects Using Observational Time Series
23.3	Lags to Estimate the Time Path of Effects
23.4	Leads to Examine Pre-trends and Reverse Effects
23.5	Pooled Time Series to Estimate the Effect for One Unit
23.A	CASE STUDY – Import Demand and Industrial Production
23.6	Panel Regression with Fixed Effects
23.7	Aggregate Trend
23.B	CASE STUDY – Immunization against Measles and Saving Children
23.8	Clustered Standard Errors
23.9	Panel Regression in First Differences
23.10	Lags and Leads in FD Panel Regressions
23.11	Aggregate Trend and Individual Trends in FD Models
23.12	Panel Regressions and Causality
23.13	First Differences or Fixed Effects?
23.14	Dealing with Unbalanced Panels
23.15	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading

Chapter 24: Appropriate Control Groups for Panel Data

This chapter discusses how data analysts can select a subset of the untreated observations in the data that are the best to learn about the counterfactual, and when that needs to be a conscious choice instead of using all available observations in the data. More

chapter outline → slides CH24A CH24B

Section	Title
24.1	When and Why to Select a Control Group in xt Panel Data
24.2	Comparative Case Studies
24.3	The Synthetic Control Method
24.A	CASE STUDY – Estimating the Effect of the 2010 Haiti Earthquake on GDP
24.4	Event Studies
24.B	CASE STUDY – Estimating the Impact of Replacing Football Team Managers
24.5	Selecting a Control Group in Event Studies
24.6	Main Takeaways
	Practice Questions
	Data Exercises
	References and Further Reading

Table of Contents