Each chapter provides summaries, outline, slides, and case study links.
Table of Contents
Downloads: Full contents (PDF), Index (PDF), Sample Chapters 10 & 14
Slides: For LaTeX versions, contact us.
PART I: DATA EXPLORATION
Chapter 01: Origins of Data
This chapter is about data collection and data quality. More The chapter starts by introducing key concepts of data. It then describes the most important methods of data collection used in business, economics, and policy analysis, such as web scraping, using administrative sources, and conducting surveys. We introduce aspects of data quality, such as validity and reliability of variables and coverage of observations. We discuss how to assess and link data quality to how the data was collected. We devote a section to Big Data to understand what it is and how it may differ from more traditional data. This chapter also covers sampling, including random sampling and potential biases due to noncoverage and nonresponse, as well as ethical issues and some good practices in data collection.
chapter outline →
slides
CH01A
CH01B
CH01C
CH01B1
CH01B2
CH01B3
CH01C1
CH01C2
CH01C3
Chapter 02: Preparing Data for Analysis
This chapter is about preparing data for analysis: how to start working with data. More First, we clarify some concepts: types of variables, types of observations, data tables, and datasets. We then turn to the concept of tidy data: data tables with the same kinds of observations. We discuss potential issues with observations and variables, and how to deal with those issues. We describe good practices for the process of data cleaning and discuss the additional challenges of working with Big Data.
slides
CH02A
CH02B
CH02C
Chapter 03: Exploratory Data Analysis
The chapter starts with exploratory data analysis is important. More It then discusses some basic concepts such as frequencies, probabilities, distributions, and extreme values. It includes guidelines for producing informative graphs and tables for presentation and describes the most important summary statistics. The chapter and its appendix also cover some of the most important theoretical distributions and their uses.
slides
CH03A
CH03B
CH03C
CH03D
CH03U1
Chapter 04: Comparison and Correlation
Most methods of data analysis are based on comparing values of one variable, y, across observations with different values of another variable, x, or more such variables. This chapter introduces simple methods of such comparison. More We start by emphasizing that we need to define both y and x precisely for meaningful comparisons, and we need to measure them well. We introduce conditioning, and we discuss conditional comparisons, or further conditioning, which takes values of other variables into account as well. We discuss conditional probabilities, conditional distributions, and conditional means. We introduce the related concepts of dependence, mean-dependence, and we introduce covariance and correlation. Throughout the chapter, we discuss informative visualization of the various kinds of comparisons.
slides
CH04A
Chapter 05: Generalizing from Data
This chapter introduces the conceptual issues with generalizing results from our data to the general pattern we care about and methods of statistical inference. More We start by discussing the two steps of the process of generalization: generalizing from the data to the general pattern our data represents, such as a population, and assessing how the general pattern that is relevant for the situation we care about relates to the general pattern our data represents. The first task is statistical inference, the second is assessing external validity. We introduce the conceptual framework of repeated samples and estimation. We introduce the standard error and the confidence interval that quantify the uncertainty of this step of generalization. We introduce two methods to estimate the standard error, the bootstrap and the standard error formula. Discussing external validity, we acknowledge that there are no readily available methods to quantify the uncertainty of this step of generalization, but we discuss how we can think about it and how we may use the results of additional data analysis to assess it.
slides
CH05A
Chapter 06: Testing Hypotheses
This chapter introduces the logic and practice of testing hypotheses. More We describe the steps of hypothesis testing and discuss two alternative ways to carry it out: one with the help of a test statistic and a critical value, and another one with the help of a p-value. We discuss how decision rules are derived from our desire to control the likelihood of making erroneous decisions (false positives and false negatives), and how significance levels, power, and p-values are related to the likelihood of those errors. We focus on testing hypotheses about averages, but, as we show in one of our case studies, this focus is less restrictive than it may appear. The chapter covers one-sided versus two-sided alternatives, issues with testing multiple hypotheses, the perils of p-hacking, and some issues with testing on Big Data.
slides
CH06A
CH06B
PART II: REGRESSION ANALYSIS
Chapter 07: Simple Regression
In this chapter, we introduce simple non-parametric regression and simple linear regression. More We discuss nonparametric regressions such as bin scatters, step functions and lowess regressions and their visualization. The larger part of the chapter discusses simple linear regression in detail. We introduce the regression equation, how its coefficients are estimated in actual data by the method of ordinary least squares (OLS), and we emphasize how to interpret the coefficients. We introduce the concepts of predicted value, residual, and goodness of fit, and we discuss the relationship between regression and correlation. We end with a note on the relationship between causation and regression.
slides
CH07A
Chapter 08: Complicated Patterns and Messy Data
The first part of this chapter covers how linear regression analysis can accommodate nonlinear patterns. More We discuss transforming either or both the dependent variable and the explanatory variable, such as taking log; piecewise linear spline; and quadratic and higher-order polynomials. We discuss whether and when to apply each technique, we emphasize the correct interpretation of the coefficients of these regressions and how we may visualize their results.
The second half of the chapter discusses potential issues with regression analysis with influential observations and measurement error in variables. The chapter closes by discussing whether and how to use weights in regression analysis.
slides
CH08A
CH08B
CH08C
Chapter 09: Generalizing Results of a Regression
This chapter discusses the methods of generalizing results of a linear regression from our data to the general pattern we care about. More We start by describing the two steps of generalization in the context of regression analysis: statistical inference and external validity. Then we turn to statistical inference: quantifying uncertainty brought about by generalizing to the general pattern represented by our data. We discuss how to estimate the standard errors and confidence intervals of the estimates of the regression coefficients, how to estimate prediction intervals, and how to test hypotheses about regression coefficients. We introduce ways to visualize the confidence interval and the prediction interval together with the regression line, and we introduce the standard way to present the results of regression analysis in tables.
slides
CH09A
CH09B
Chapter 10: Multiple Linear Regression
This chapter introduces multiple regression. More We start by discussing why and when we should estimate a multiple regression and how to interpret its coefficients. We then turn to how to construct and interpret confidence intervals of regression coefficients and test hypotheses about regression coefficients. We discuss the relationship between multiple regression and simple regression and derive the omitted variable bias. We explain that piecewise linear splines and polynomial regressions are technically multiple linear regressions without the same interpretation of the coefficients. We discuss how to include categorical explanatory variables as well as interactions that help uncover different slopes for groups. We include an informal discussion on how to decide what explanatory variables to include and in what functional form. Finally, we discuss why a typical multiple regression with cross-sectional observational data is not a ceteris paribus comparison, and that, as a result, it may get us closer to causal interpretation without fully uncovering it.
slides
CH10A
CH10B
Chapter 11: Modeling Probabilities
This chapter introduces probability models that have a binary dependent variable. More It starts with the linear probability model, and we discuss the interpretation of its coefficients. Linear probability models are usually fine to uncover average associations, but they may be less good for prediction. The chapter introduces the two commonly used alternative models, the logit and the probit. Their coefficients are hard to interpret; we introduce marginal differences that are transformations of the coefficients and have interpretations similar to the coefficients of linear regressions. We argue that linear probability, logit, and probit models often produce very similar results in terms of the associations with explanatory variables, but they may lead to different predictions. We discuss and compare various measures of fit for probability models, such as the Brier-score, and we introduce the concept of calibration. We end by explaining how data analysts can analyze more complicated y variables, such as ordinal qualitative variables or duration variables, by turning them into binary ones and estimating probability models.
slides
CH11A
CH11B
Chapter 12: Regression with Time Series Data
In this chapter we discuss the opportunities and challenges brought about by regression analysis of time series data and how to address those challenges. More The chapter starts by discussing features of time series variables, such as trends, seasonality, random walk, and serial correlation. We explain why those features make regression analysis challenging and what we can do about them. In particular, we discuss when it’s a good idea to transform the y and x variables into differences, or relative differences. We introduce two methods to get appropriate standard error estimates in time series regressions: the Newey–West standard error estimator and including the lagged y variable on the right-hand side. We also discuss how we can estimate delayed associations by adding lags of x to a time series regression, and how we can directly estimate cumulative, or long run, associations in such a regression.
slides
CH12A
CH12B
PART III: PREDICTION
Chapter 13: A Framework for Prediction
This chapter introduces a framework for prediction. More We discuss the distinction between various types of prediction, such as quantitative predictions, probability predictions, and classification, and we focus on the first of these. We introduce point prediction versus interval prediction and we discuss the components of the prediction error. The main focus of this chapter is how to find the best prediction model, using observations in the original data, that will likely produce the best fit (smallest prediction error) in the live data. We introduce loss functions in general, and mean squared error (MSE) and its square root (RMSE) in particular, to evaluate predictions. We discuss three ways of finding the best predictor model: using all data and the Bayesian Information Criterion (BIC) as the measure of fit, using training–test splitting of the data, and using k-fold cross-validation, which is an improvement on the training–test split. We discuss how to assess and, if possible, improve the external validity of predictions. We close the chapter by discussing what machine learning means.
slides
CH13A
Chapter 14: Model Building for Prediction
This chapter discusses how to build regression models for prediction and how to evaluate the predictions they produce. More With respect to model building, we discuss whether and when it’s a good idea to take logs of the y variable and what to do with such a prediction, as well as how to select variables out of a large pool of candidate x variables, and how to decide on their functional forms and including their interactions. We introduce LASSO, an algorithm that can help with all that. With respect to evaluating predictions, we discuss why we need a holdout sample. We close this chapter with a discussion on the additional opportunities and challenges Big Data brings for predictive analytics.
slides
CH14A
CH14B
Chapter 15: Regression Trees
This chapter introduces the regression tree, an alternative to linear regression for prediction purposes that can find the most important predictor variables and their interactions and can approximate any functional form automatically. More Regression trees split the data into small bins (subsamples) by the value of the x variables. For a quantitative y, they use the average y value in those small sets to predict ˆy. We introduce the regression tree model and the most widely used algorithm to build a regression tree model. Somewhat confusingly, both the model and the algorithm are called CART (for classification and regression trees), but we reserve this name for the algorithm. We show that a regression tree is an intuitively appealing method to model nonlinearities and interactions among the x variables, but it is rarely used for prediction in itself because it is prone to overfit the original data. Instead, the regression tree forms the basic element of very powerful prediction methods that we’ll cover in the next chapter.
slides
CH15A
Chapter 16: Random Forest and Boosting
This chapter introduces two ensemble methods based on regression trees: the random forest and boosting. More We start by introducing the main idea of ensemble methods: combining results from many imperfect models can lead to a much better prediction than a single model that we try to build to perfection. Of the two methods, we discuss the random forest (RF) in more detail. The random forest is perhaps the most frequently used method to predict a quantitative y variable, both because of its excellent predictive performance and because it is relatively simple to use. Ensemble methods are black box models, because their results do not help understand the underlying patterns of association between y and the x variables. We discuss some diagnostic tools that can help with that: variable importance plots, partial dependence plots, and examining the quality of predictions in subgroups. Finally, we briefly introduce the idea of boosting, an alternative approach to make predictions based on an ensemble of regression trees. There are various boosting methods, and they can produce even better predictions, but their use requires more expertise. We illustrate the power of boosting through the performance of the gradient boosting machine (GBM) method.
slides
CH16A
Chapter 17: Probability Prediction and Classification
This chapter introduces the framework and methods of probability prediction and classification analysis for binary y variables. More Probability prediction means predicting the probability that y = 1, with the help of the predictor variables. Classification means predicting the binary y variable itself, with the help of the predictor variables: putting each observation in one of the y categories, also called classes. We build on what we know about probability models and the basics of probability prediction from Chapter 11. In this chapter, we put that into the framework of predictive analytics to arrive at the best probability model for prediction purposes and to evaluate its performance. We then discuss how we can turn probability predictions into classification with the help of a classification threshold and how we should use a loss function to find the optimal threshold. We discuss how to evaluate a classification making use of a confusion table and expected loss. We introduce the ROC curve, which illustrates the trade-off of selecting different classification threshold values. We discuss how we can use random forest based on classification trees. Finally, we note the potential issues with the probability prediction and classification of rare events.
slides
CH17A
Chapter 18: Forecasting from Time Series Data
This chapter discusses forecasting: prediction from time series data for one or more time periods in the future. More The focus of this chapter is forecasting future values of one variable, by making use of past values of the same variable, and possibly other variables, too. We build on what we learned about time series regressions in Chapter 12. We start with forecasts with a long horizon, which means many time periods into the future. Such forecasts use information on trends, seasonality, and other long-term features of the time series. We then turn to short-horizon forecasts that forecast y for a few time periods ahead. These forecasts make use of serial correlation of the time series of y besides those long-term features. We introduce autoregression (AR) and ARIMA models, which capture the patterns of serial correlation and can use it for short-horizon forecasting. We then turn to using other variables in forecasting, and introduce vector autoregression (VAR) models that help in forecasting future values of those x variables that we can use to forecast y. We discuss how to carry out cross-validation in forecasting and the specific challenges and opportunities the time series nature of our data provide for assessing external validity.
slides
CH18A
CH18B
PART IV: CAUSAL ANALYSIS
Chapter 19: A Framework for Causal Analysis
This chapter introduces a framework for causal analysis. More The chapter starts by introducing the potential outcomes framework to define subjects, interventions, outcomes, and effect. We define the individual treatment effect, the average treatment effect, and the average treatment effect on the treated. We then show how these effects can be understood using the closely related definition of ceteris paribus comparisons. We close the conceptual part by introducing causal maps, which visualize data analysts’ assumptions about the relationships between several variables. We start our discussion of how to uncover an average treatment effect using actual data by focusing on the sources of variation in the causal variable, and we distinguish exogenous and endogenous sources. We define random assignment and show how it helps uncover the average effect. We then turn to issues with identifying effects in observational data. We define confounders, and we discuss that, in principle, we could identify average effects by conditioning on them. We then briefly discuss additional issues about variables we should not condition on, and the consequences of the typical mismatch between latent variables we think about and variables we can measure in real data. Finally, we discuss internal validity and external validity in causal analysis.
slides
CH19A
Chapter 20: Designing and Analyzing Experiments
This chapter discusses the most important questions about designing an experiment and analyzing data from an experiment to estimate the average effect of an intervention. More The first part of the chapter focuses on design; the second part focuses on analysis. We start by discussing different kinds of controlled experiments, such as field experiments, A/B testing, and survey experiments. We discuss how to carry out random assignment in practice, why and how to check covariate balance, and how to actually estimate the effect and carry out statistical inference using the estimate. We introduce imperfect compliance and its consequences, as well as spillovers and other potential threats to internal validity. Among the more advanced topics, we introduce the local average treatment effect and power calculation or sample size calculation that calculates the number of subjects that we would need for our experiment. We conclude the chapter by discussing how we can think about external validity of controlled experiments, and whether and how we can use data to help assess external validity.
slides
CH20A
CH20B
Chapter 21: Regression and Matching with Observational Data
In this chapter we discuss how to condition on potential confounder variables in practice, and how to interpret the results when our question is causal. More We start with multiple linear regression, and we discuss how to select the variables to condition on and how to decide on their functional form. We then turn to matching, which is an intuitive alternative that turns out to be quite complicated to carry out in practice. We discuss exact matching and matching on the propensity score. Matching can detect a lack of common support (when some values of confounders appear only among treated or untreated observations). However, with common support, regression and matching, when applied according to good practice, tend to give similar results. We also give a very brief introduction to other methods: instrumental variables and regression discontinuity. These methods can give good effect estimates even when we don’t have all confounders in our data, but they can only be applied in specific circumstances. This chapter reviews methods that can be used for all kinds of observational data in principle but used for cross-sectional data in practice, because data with a time series dimension, especially panel data, offers additional opportunities, which call for more specific methods.
slides
CH21A
Chapter 22: Difference-in-Differences
This chapter introduces difference-in-differences analysis, or diff-in-diffs for short, and its use in understanding the effect of an intervention. More We explain how to use xt panel data covering two time periods to carry out diff-in-diffs by comparing average changes from before an intervention to after it, and how to implement this in a simple regression. We discuss the parallel trends assumption that’s needed for the results to show average effects and how we can assess its validity by examining pre-intervention trends. Finally, we discuss some generalizations of the method to include observed confounder variables, to estimate the effect of a quantitative causal variable, or to use pooled cross-sectional data instead of an xt panel, with different subjects before and after the intervention.
slides
CH22A
Chapter 23: Methods for Panel Data
This chapter introduces the most widely used regression methods to uncover the effect of an intervention when observational time series (tseries) data or cross-section time-series (xt) panel data is available with more than two time periods. More We discuss the potential advantages of having more time periods in allowing within-subject comparisons, assessing pre-intervention trends, and tracing out effects through time. We then review time series regressions, and we discuss what kind of average effect they can estimate, under what conditions, and how adding lags and leads can help uncover delayed effects and reverse effects. Then we discuss when we can pool several similar time series to get a more precise estimate of the effect. This is xt panel data with few cross-sectional units, and we discuss how to use such data to estimate an effect for a single cross-sectional unit.
In the second, and larger, part of the chapter, we turn to xt panel data with many cross-sectional units and more than two time periods, to estimate an average effect across the cross-sectional units. We introduce two methods: panel fixed-effects regressions (FE regressions) and panel regressions in first differences (FD regressions). Both can be viewed as generalizations of the diff-in-diffs method we covered in Chapter 22. We explain when each kind of regression may give a good estimate of the average effect, and we show how adding lags and leads can help uncover delayed effects, differences in pre-trends, and reverse effects. We discuss adding binary time variables to deal with aggregate trends of any form in FE or FD regressions, and how we can treat unit-specific linear trends in FD regressions. We discuss clustered standard error estimation that helps address serial correlation and heteroskedasticity at the same time. We briefly discuss how to analyze unbalanced panel data, and we close the chapter by comparing FE and FD regressions.
slides
CH23A
CH23B
Chapter 24: Appropriate Control Groups for Panel Data
This chapter discusses how data analysts can select a subset of the untreated observations in the data that are the best to learn about the counterfactual, and when that needs to be a conscious choice instead of using all available observations in the data. More We introduce two methods. The first one is the synthetic control method, which creates a single counterfactual to an intervention that affects a single subject. We discuss how to select the donor pool of subjects that are similar to the treated subject and how the synthetic control algorithm uses pre-treatment variables to assign weights to each of them to create a single synthetic control subject.
The part of the chapter discusses the event study method, which helps trace the time path of the effect on many subjects that experience an intervention at different time points. Event studies are FD or FE panel regressions with a twist. Besides introducing the method, we discuss how we can choose an appropriate control group by defining pseudo-interventions and making sure their are similar to treated subjects in terms of average pre-treatment variables. We show how we can include them in event study regressions and how we can visualize the results of such regressions and interpret its estimated coefficients.
slides
CH24A
CH24B