Topics

Topics

Part I Data Exploration

Part I Data Exploration covers the most important tools and methods data analysts need to start working with data from scratch. It covers organizing, preparing, describing and documenting data as well as the statistical foundations, including summary statistics, distributions, statistical inference, external validity, and the basics of hypothesis testing. It discusses how the big data revolution affects all aspects of data exploration, from the need for new data management tools through to the diminished importance of statistical inference relative to external validity.

Part II Regression Analysis

Part II Regression Analysis introduces the workhorse of empirical research: linear regression. It starts with the simple nonparametric and linear regression, discusses how linear regression can approximate nonlinear patterns of association, how to generalize results of regression analysis beyond the data, and it introduces multiple linear regression. The last two chapters introduce probability models and the basics of regression analysis with time series data.

Part III Prediction

covers the most important methods of predictive data analysis. It outlines a framework for prediction, emphasizing the goal of using your data to make predictions outside the framework. It discusses the process of cross-validation with a loss function, model building, LASSO, and the role of a holdout sample to carry out diagnostics. It discusses prediction with the help of linear regression, and gives a thorough introduction to the regression tree and random forest, and touches on boosting. The last two chapters discuss probability predictions and classification, and forecasting from time series data.

Part IV Causal Analysis

Part IV Causal Analysis covers the most important elements of analyzing the effect of interventions. It starts with outlining a framework for causal analysis, with the potential outcomes framework, defines the average treatment effect, and exogenous and endogenous sources of variation in the causal variable. It includes a chapter on designing and analyzing experiments, and one on conditioning on observable endogenous sources of variation in observational data by regression or matching. The second half of Part IV discusses uncovering effects with the help of multiple observations per subject. It covers simple difference-in-differences analysis, simple time series regressions, and more general methods using panel data, such as regressions in first differences, with fixed effects, synthetic controls, and event studies.