Chapters
PART I: DATA EXPLORATION –
Ch01
Ch02
Ch03
Ch04
Ch05
Ch06
PART II: REGRESSION ANALYSIS –
Ch07
Ch08
Ch09
Ch10
Ch11
Ch12
PART III: PREDICTION –
Ch13
Ch14
Ch15
Ch16
Ch17
Ch18
PART IV: CAUSAL ANALYSIS –
Ch19
Ch20
Ch21
Ch22
Ch23
Ch24
Downloads
Download the front - detailed list of chapters and sections ……… CONTENTS
Download a sample chapter from the almost final version of the proof… Chapter 10 Multiple Linear Regression
Download a short summary on why use this book …… TEXTBOOK SUMMARY
PART I: DATA EXPLORATION
Chapter 01: Origins of Data
This chapter is about data collection and data quality. The chapter starts by introducing key concepts of data. It then describes the most important methods of data collection used in business, economics, and policy analysis, such as web scraping, using administrative sources, and conducting surveys. We introduce aspects of data quality, such as validity and reliability of variables and coverage of observations. We discuss how to assess and link data quality to how the data was collected. We devote a section to Big Data to understand what it is and how it may differ from more traditional data. This chapter also covers sampling, incuding random sampling and potential biases due to noncoverage and nonresponse, as well as ethical issues and some good practices in data collection.
Chapter 02: Preparing Data for Analysis
This chapter is about preparing data for analysis: how to start working with data. First, we clarify some concepts: types of variables, types of observations, data tables, and datasets. We then turn to the concept of tidy data: data tables with the same kinds of observations. We discuss potential issues with observations and variables, and how to deal with those issues. We describe good practices for the process of data cleaning and discuss the additional challenges of working with Big Data.
Chapter 03: Exploratory Data Analysis
The chapter starts with exploratory data analysis is important. It then discusses some basic concepts such as frequencies, probabilities, distributions, and extreme values. It includes guidelines for producing informative graphs and tables for presentation and describes the most important summary statistics. The chapter and its appendix also cover some of the most important theoretical distributions and their uses.
Chapter 04: Comparison and Correlation
Most methods of data analysis are based on comparing values of one variable, y, across observations with different values of another variable, x, or more such variables. This chapter instroduces simple methods of such comparison. We start by emphasizing that we need to define both y and x precisely for meaningful comparisons, and we need to measure them well. We introduce conditioning, and we discuss conditional comparisons, or further conditioning, which takes values of other variables into account as well. We discuss conditional probabilities, conditional distributions, and conditional means. We introduce the related concepts of dependence, mean-dependence, and we introduce covariance and correlation. Throughout the chapter, we discuss informative visualization of the various kinds of comparisons.
Chapter 05: Generalizing from Data
This chapter introduces the conceptual issues with generalizing results from our data to the general pattern we care about and methods of statistical inference. We start by discussing the two steps of the process of generalization: generalizing from the data to the general pattern our data represents, such as a population, and assessing how the general pattern that is relevant for the situation we care about relates to the general pattern our data represents. The first task is statistical inference, the second is assessing external validity. We introduce the conceptual framework of repeated samples and estimation. We introduce the standard error and the confidence interval that quantify the uncertainty of this step of generalization. We introduce two methods to estimate the standard error, the bootstrap and the standard error formula. Discussing external validity, we acknowledge that there are no readily available methods to quantify the uncertainty of this step of generalization, but we discuss how we can think about it and how we may use the results of additional data analysis to assess it.
Chapter 06: Testing Hypotheses
This chapter introduces the logic and practice of testing hypotheses. We describe the steps of hypothesis testing and discuss two alternative ways to carry it out: one with the help of a test statistic and a critical value, and another one with the help of a p-value. We discuss how decision rules are derived from our desire to control the likelihood of making erroneous decisions (false positives and false negatives), and how significance levels, power, and p-values are related to the likelihood of those errors. We focus on testing hypotheses about averages, but, as we show in one of our case studies, this focus is less restrictive than it may appear. The chapter covers one-sided versus two-sided alternatives, issues with testing multiple hypotheses, the perils of p-hacking, and some issues with testing on Big Data.
Part II: REGRESSION ANALYSIS
Chapter 07: Simple Regression
In this chapter, we introduce simple non-parametric regression and simple linear regression. We discuss nonparametric regressions such as bin scatters, step functions and lowess regressions and their visualization. The larger part of the chapter discusses simple linear regression in detail. We introduce the regression equation, how its coefficients are estimated in actual data by the method of ordinary least squares (OLS), and we emphasize how to interpret the coefficients. We introduce the concepts of predicted value, residual, and goodness of fit, and we discuss the relationship between regression and correlation. We end with a note on the relationship between causation and regression.
Chapter 08: Complicated Patterns and Messy Data
The first part of this chapter covers how linear regression analysis can accommodate nonlinear patterns. We discuss transforming either or both the dependent variable and the explanatory variable, such as taking log; piecewise linear spline; and quadratic and higher-order polynomials. We discuss whether and when to apply each technique, we emphasize the correct interpretation of the coefficients of these regressions and how we may visualize their results.
The second half of the chapter discusses potential issues with regression analysis with influential observations and measurement error in variables. The chapter closes by discussing whether and how to use weights in regression analysis.
Chapter 09: Generalizing Results of a Regression
This chapter discusses the methods of generalizing results of a linear regression from our data to the general pattern we care about. We start by describing the two steps of generalization in the context of regression analysis: statistical inference and external validity. Then we turn to statistical inference: quantifying uncertainty brought about by generalizing to the general pattern represented by our data. We discuss how to estimate the standard errors and confidence intervals of the estimates of the regression coefficients, how to estimate prediction intervals, and how to test hypotheses about regression coefficients. We introduce ways to visualize the confidence interval and the prediction interval together with the regression line, and we introduce the standard way to present the results of regression analysis in tables.
Chapter 10: Multiple Linear Regression
This chapter introduces multiple regression. We start by discussing why and when we should estimate a multiple regression and how to interpret its coefficients. We then turn to how to construct and interpret confidence intervals of regression coefficients and test hypotheses about regression coefficients. We discuss the relationship between multiple regression and simple regression and derive the omitted variable bias. We explain that piecewise linear splines and polynomial regressions are technically multiple linear regressions without the same interpretation of the coefficients. We discuss how to include categorical explanatory variables as well as interactions that help uncover different slopes for groups. We include an informal discussion on how to decide what explanatory variables to include and in what functional form. Finally, we discuss why a typical multiple regression with cross-sectional observational data is not a ceteris paribus comparison, and that, as a result, it may get us closer to causal interpretation without fully uncovering it.
Chapter 11: Modeling Probabilities
This chapter introduces probability models that have a binary dependent variable. It starts with the linear probability model, and we discuss the interpretation of its coefficients. Linear probability models are usually fine to uncover average associations, but they may be less good for prediction. The chapter introduces the two commonly used alternative models, the logit and the probit. Their coefficients are hard to interpret; we introduce marginal differences that are transformations of the coefficients and have interpretations similar to the coefficients of linear regressions. We argue that linear probability, logit, and probit models often produce very similar results in terms of the associations with explanatory variables, but they may lead to different predictions. We discuss and compare various measeures of fit for probability models, such as the Brier-score, and we introduce the concept of calibration. We end by explaining how data analysts can analyze more complicated y variables, such as ordinal qualitative variables or duration variables, by turning them into binary ones and estimating probability models.
Chapter 12: Regression with Time Series Data
In this chapter we discuss the opportunities and challenges brought about by regression analyiss of time series data and how to address those challenges. The chapter starts by discussing features of time series variables, such as trends, seasonality, random walk, and serial correlation. We explain why those features make regression analysis challenging and what we can do about them. In particular, we discuss when it’s a good idea to transform the y and x variables into differences, or relative differences. We introduce two methods to get appropriate standard error estimates in time series regressions: the Newey–West standard error estimator and including the lagged y variable on the right-hand side. We also discuss how we can estimate delayed associations by adding lags of x to a time series regression, and how we can directly estimate cumulative, or long run, associations in such a regression.
PART III: PREDICTION
Chapter 13: A Framework for Prediction
This chapter introduces a framework for prediction. We discuss the distinction between various types of prediction, such as quantitative predictions, probability predictions, and classification, and we focus on the first of these. We introduce point prediction versus interval prediction and we discuss the components of the prediction error. The main focus of this chapter is how to find the best prediction model, using observations in the original data, that will likely produce the best fit (smallest prediction error) in the live data. We introduce loss functions in general, and mean squared error (MSE) and its square root (RMSE) in particular, to evaluate predictions. We discuss three ways of finding the best predictor model: using all data and the Bayesian Information Criterion (BIC) as the measure of fit, using training–test splitting of the data, and using k-fold cross-validation, which is an improvement on the training–test split. We discuss how to assess and, if possible, improve the external validity of predictions. We close the chapter by discussing what machine learning means.
Chapter 14: Model Building for Prediction
This chapter discusses how to build regression models for prediction and how to evaluate the predictions they produce. With respect to model building, we discuss whether and when it’s a good idea to take logs of the y variable and what to do with such a prediction, as well as how to select variables out of a large pool of candidate x variables, and how to decide on their functional forms and including their interactions. We introduce LASSO, an algorithm that can help with all that. With respect to evaluating predictions, we discuss why we need a holdout sample. We close this chapter with a discussion on the additional opportunities and challenges Big Data brings for predictive analytics.
Chapter 15: Regression Trees
This chapter introduces the regression tree, an alternative to linear regression for prediction purposes that can find the most important predictor variables and their interactions and can approximate any functional form automatically. Regression trees split the data into small bins (subsamples) by the value of the x variables. For a quantitative y, they use the average y value in those small sets to predict ˆy. We introduce the regression tree model and the most widely used algorithm to build a regression tree model. Somewhat confusingly, both the model and the algorithm are called CART (for classification and regression trees), but we reserve this name for the algorithm. We show that a regression tree is an intuitively appealing method to model nonlinearities and interactions among the x variables, but it is rarely used for prediction in itself because it is prone to overfit the original data. Instead, the regression tree forms the basic element of very powerful prediction methods that we’ll cover in the next chapter.
Chapter 16: Random Forest and Boosting
This chapter introduces two ensemble methods based on regression trees: the random forest and boosting. We start by introducing the main idea of ensemble methods: combining results from many imperfect models can lead to a much better prediction than a single model that we try to build to perfection. Of the two methods, we discuss the random forest (RF) in more detail. The random forest is perhaps the most frequently used method to predict a quantitative y variable, both because of its excellent predictive performance and because it is relatively simple to use. Ensemble methods are black box models, because their results do not help understand the underlying patterns of association between y and the x variables. We discuss some diagnostic tools that can help with that: variable importance plots, partial dependence plots, and examining the quality of predictions in subgroups. Finally, we briefly introduce the idea of boosting, an alternative approach to make predictions based on an ensemble of regression trees. There are various boosting methods, and they can produce even better predictions, but their use requires more expertise. We illustrate the power of boosting through the performance of the gradient boosting machine (GBM) method.
Chapter 17: Probability Prediction and Classification
This chapter introduces the framework and methods of probability prediction and classification analysis for binary y variables. Probability prediction means predicting the probability that y = 1, with the help of the predictor variables. Classification means predicting the binary y variable itself, with the help of the predictor variables: putting each observation in one of the y categories, also called classes. We build on what we know about probability models and the basics of probability prediction from Chapter 11. In this chapter, we put that into the framework of predictive analytics to arrive at the best probability model for prediction purposes and to evaluate its performance. We then discuss how we can turn probability predictions into classification with the help of a classification threshold and how we should use a loss function to find the optimal threshold. We discuss how to evaluate a classification making use of a confusion table and expected loss. We introduce the ROC curve, which illustrates the trade-off of selecting different classification threshold values. We discuss how we can use random forest based on classification trees. Finally, we note the potential issues with the probability prediction and classification of rare events.
Chapter 18: Forecasting from Time Series Data
This chapter discusses forecasting: prediction from time series data for one or more time periods in the future. The focus of this chapter is forecasting future values of one variable, by making use of past values of the same variable, and possibly other variables, too. We build on what we learned about time series regressions in Chapter 12. We start with forecasts with a long horizon, which means many time periods into the future. Such forecasts use information on trends, seasonality, and other long-term features of the time series. We then turn to short-horizon forecasts that forecast y for a few time periods ahead. These forecasts make use of serial correlation of the time series of y besides those long-term features. We introduce autoregression (AR) and ARIMA models, which capture the patterns of serial correlation and can use it for short-horizon forecasting. We then turn to using other variables in forecasting, and introduce vector autoregression (VAR) models that help in forecasting future values of those x variables that we can use to forecast y. We discuss how to carry out cross-validation in forecasting and the specific challenges and opportunities the time series nature of our data provide for assessing external validity.
PART IV: CAUSAL ANALYSIS
Chapter 19: A Framework for Causal Analysis
This chapter introduces a framework for causal analysis. The chapter starts by introducing the potential outcomes framework to define subjects, interventions, outcomes, and effect. We define the individual treatment effect, the average treatment effect, and the average treatment effect on the treated. We then show how these effects can be understood using the closely related definition of ceteris paribus comparisons. We close the conceptual part by introducing causal maps, which visualize data analysts’ assumptions about the relationships between several variables. We start our discussion of how to uncover an average treatment effect using actual data by focusing on the sources of variation in the causal variable, and we distinguish exogenous and endogenous sources. We define random assignment and show how it helps uncover the average effect. We then turn to issues with identifying effects in observational data. We define confounders, and we discuss that, in principle, we could identify average effects by conditioning on them. We then briefly discuss additional issues about variables we should not condition on, and the consequences of the typical mismatch between latent variables we think about and variables we can measure in real data. Finally, we discuss internal validity and external validity in causal analysis.
Chapter 20: Designing and Analyzing Experiments
This chapter discusses the most important questions about designing an experiment and analyzing data from an experiment to estimate the average effect of an intervention. The first part of the chapter focuses on design; the second part focuses on analysis. We start by discussing different kinds of controlled experiments, such as field experiments, A/B testing, and survey experiments. We discuss how to carry out random assignment in practice, why and how to check covariate balance, and how to actually estimate the effect and carry out statistical inference using the estimate. We introduce imperfect compliance and its consequences, as well as spillovers and other potential threats to internal validity. Among the more advanced topics, we introduce the local average treatment effect and power calculation or sample size calculation that calculates the number of subjects that we would need for our experiment. We conclude the chapter by discussing how we can think about external validity of controlled experiments, and whether and how we can use data to help assess external validity.
Chapter 21: Regression and Matching with Observational Data
In this chapter we discuss how to condition on potential confounder variables in practice, and how to interpret the results when our question is causal. We start with multiple linear regression, and we discuss how to select the variables to condition on and how to decide on their functional form. We then turn to matching, which is an intuitive alternative that turns out to be quite complicated to carry out in practice. We discuss exact matching and matching on the propensity score. Matching can detect a lack of common support (when some values of confounders appear only among treated or untreated observations). However, with common support, regression and matching, when applied according to good practice, tend to give similar results. We also give a very brief introduction to other methods: instrumental variables and regression discontinuity. These methods can give good effect estimates even when we don’t have all confounders in our data, but they can only be applied in specific circumstances. This chapter reviews methods that can be used for all kinds of observational data in principle but used for cross-sectional data in practice, because data with a time series dimension, especially panel data, offers additional opportunities, which call for more specific methods.
Chapter 22: Difference-in-Differences
This chapter introduces difference-in-differences analysis, or diff-in-diffs for short, and its use in understanding the effect of an intervention. We explain how to use xt panel data covering two time periods to carry out diff-in-diffs by comparing average changes from before an intervention to after it, and how to implement this in a simple regression. We discuss the parallel trends assumption that’s needed for the results to show average effects and how we can assess its validity by examining pre-intervention trends. Finally, we discuss some generalizations of the method to include observed confounder variables, to estimate the effect of a quantitative causal variable, or to use pooled cross-sectional data instead of an xt panel, with different subjects before and after the intervention.
Chapter 23: Methods for Panel Data
This chapter introduces the most widely used regression methods to uncover the effect of an intervention when observational time series (tseries) data or cross-section time-series (xt) panel data is available with more than two time periods.
We discuss the potential advantages of having more time periods in allowing within-subject comparisons, assessing pre-intervention trends, and tracing out effects through time. We then review time series regressions, and we discuss what kind of average effect they can estimate, under what conditions, and how adding lags and leads can help uncover delayed effects and reverse effects. Then we discuss when we can pool several similar time series to get a more precise estimate of the effect. This is xt panel data with few cross-sectional units, and we discuss how to use such data to estimate an effect for a single cross-sectional unit.
In the second, and larger, part of the chapter, we turn to xt panel data with many cross-sectional units and more than two time periods, to estimate an average effect across the cross-sectional units. We introduce two methods: panel fixed-effects regressions (FE regressions) and panel regressions in first differences (FD regressions). Both can be viewed as generalizations of the diff-in-diffs method we covered in Chapter 22. We explain when each kind of regression may give a good estimate of the average effect, and we show how adding lags and leads can help uncover delayed effects, differences in pre-trends, and reverse effects. We discuss adding binary time variables to deal with aggregate trends of any form in FE or FD regressions, and how we can treat unit-specific linear trends in FD regressions. We discuss clustered standard error estimation that helps address serial correlation and heteroskedasticity at the same time. We briefly discuss how to analyze unbalanced panel data, and we close the chapter by comparing FE and FD regressions.
Chapter 24: Appropriate Control Groups for Panel Data
This chapter discusses how data analysts can select a subset of the untreated observations in the data that are the best to learn about the counterfactual, and when that needs to be a conscious choice instead of using all available observations in the data. We introduce two methods. The first one is the synthetic control method, which creates a single counterfactual to an intervention that affects a single subject. We discuss how to select the donor pool of subjects that are similar to the treated subject and how the synthetic control algorithm uses pre-treatment variables to assign weights to each of them to create a single synthetic control subject. The part of the chapter discusses the event study method, which helps trace the time path of the effect on many subjects that experience an intervention at different time points. Event studies are FD or FE panel regressions with a twist. Besides introducing the method, we discuss how we can choose an appropriate control group by defining pseudo-interventions and making sure their are similar to treated subjects in terms of average pre-treatment variables. We show how we can include them in event study regressions and how we can visualize the results of such regressions and interpret its estimated coefficients.