# Case Studies

Data analysis is a process

# Data and code info

**To get code, data, graphs**, we advise to follow instructions on how to get all codes and dependencies for **Stata** or **R** or **Python**. All the data may be downloaded from Get Data

**Update: links will be live by end of September**

PART 1: DATA EXPLORATION

PART 2: REGRESSION ANALYSIS

PART 3: PREDICTION

PART 4: CAUSAL ANALYSIS

# PART I: DATA EXPLORATION

## CH01A Finding a good deal among hotels: data collection

Vienna, Austria is a popular tourist destination for business and leisure. From the hundreds of places that offer accommodation, we want to pick a hotel that is underpriced relative to its location and quality for a weekday in November 2017. Can we use data to help this decision? What kind of data would we need, and how could we get it?

This case study illustrates how to collect appropriate data from the web on multiple offers. It describes what we want from such data and what data source we would need. The data is collected by **web scraping**, and it results in a single **data table.** The case study discusses the **data quality** from the perspective of the question to answer and how data quality is determined by the way the data was born. There is no dataset to analyze in this case study in this chapter. Subsequent case studies (2A, 3A, 7A, 8A, 9B, 10B) will use the data desctibed here to illustrate steps of data analysis that lead to ultimately answering the main question.

## CH01B Comparing online and offline prices: data collection

Do online and offline prices of the same products tend to be the same? To answer that question, we need data on both the online and offline (in store) price of many products. Such data was collected as part of the Billion Prices Project (BPP; http://www.thebillionpricesproject.com), an umbrella of multiple projects that collect price data for various purposes using various methods.

This case study illustrates how to combine **different data collection methods** and what the challenges are with such data collection. It discusses how products were selected and how prices were measured, and what those methods imply for coverage of observations and reliability of variables. There is no dataset to analyze in this case study. Case study 6A will use the data described here to investigate whether online and offline prices tend to be the same.

## CH01C Management quality: data collection

How different are firms and other organizations in the terms of their management practices? Is the quality of management related to how large the firms are? Is it affected by whether the owners are the company founders or their families? To answer these, and many related, questions, we need data on management quality. Such data was collected by the World Management Survey (WMS; https://worldmanagementsurvey.org/), an international research intitative to measure the differences in management practices across organizations and countries.

This case study illustrates how to collect data by **surveys**. It discusses **sampling** and its practical issues, and how to use a set of survey questions to measure and abstract concept such as the quality of management. This case study, similarly to the other case studies in this chapter, illustrates the choices and trade-offs data collection involves, practical issues that may arise during implementation, and how all that may affect data
quality. There is no dataset to analyze in this case study. Case studies 4A and 21A will use the data described here to investigate how management quality is related to firm size and how it is affected by ownership.

## CH02A Finding a good deal among hotels: data preparation

Continuing with our search for a hotel that is underpriced relative to its location and quality in Vienna, we have scraped data from the web, and we’ve got a data table. But how should we start working with this data? In particular, how should we identify hotels, how should we make sure each hotel features only once in the data, and how should we select the variables we would consider for our future analysis?

This case study uses the `hotels-vienna`

dataset to illustrate how to find problems with observations and variables. It illustrates the various **types of variables**. It shows how to create a **tidy data table** and how to deal with **missing values** and **duplicates**. It allows instructors to demonstrate the importance of **data cleaning** and the common **steps of data wrangling**. We described data collection and quality in case study 1A, and we will use the data in case studies 3A, 7A, 8A, 9B, and 10B to illustrate steps of data analysis that lead to finding good deals.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: hotels-vienna.
**Graphs**: .png or .eps

## CH02B Displaying immunization rates across countries

Immunization against measles is an effective way to prevent the disease and may save the lives of children. But how do various countries fare in terms of their immunization rates? In particular, how should we structure and use data from many countries and many years to analyze immunization rates across countries and years?

This short case study illustrates how to store **multi-dimensional data**. It uses the `world-bank-immunization`

dataset with data from the World Development Indicators data website maintained by the World Bank to look at countries’ annual immunization rate and GDP per capita. The case study illustrates the structure of **xt panel data data** with a cross-sectional and time series dimension (country and year), with two corresponding ID variables and two other variables (immunization rate and GDP per capita). It allows instructors to demonstrate xt panel data tables in **long format** and **wide format**. Case study 23B will use the data described here to investigate the effect of immunization on the survival chances of children.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: world-bank-immunization.
**Graphs**: .png or .eps

## Ch02C Identifying successful football managers

The English Premier League (EPL) is the top football (soccer) division in England. Team managers, as coaches are known in football, arguably play a very important role in the success of their teams. How can we use two separate data tables on games and managers to identify the most successful football manager in the EPL?

This case study uses the `football`

dataset that covers all games played in the EPL and data on managers, including which team they worked at and when. We create a data table by joining two different data tables, define the measure of success as average points per game, and identify the most successful managers. This case study illustrates how to prepare data for analysis and illustrates **linking data tables** with different kinds of observations and common problems that can arise while doing so. It is a good example of **entity resolution**, and how to work with **relational data**. Case study 24B will use this data to uncover the effect of replacing managers of underperfoming teams on subsequent team performance.

**Code**: **Stata** or **R** or **Python** or ALL

**Data**: football).
**Graphs**: .png or .eps

## Ch03A Finding a good deal among hotels: data exploration

Further continuing our search for a good deal (a hotel in Vienna that is underpriced for its location and quality), we’ve got a clean data table and identified the variables we want to analyze. How should we start the analysis? In particular, how should we explore the most important variables, why should we do that, and what conclusions can we draw from such exploratory analysis?

This case study uses the `hotels-vienna`

dataset to illustrate how to describe the **distribution of variables** and how to use the findings to identify potential problems in the data, such as **extreme values**. The case study also illustrate how to make **decisions about extreme values**, guided by the ultimate question of the analysis. Along the way, it introduces guidelines for **data visualization** in general, and the design of **histograms** in particular. Case studies 1A and 2A describe data collection and cleaning, and we will use the data in case studies 7A, 8A, 9B, and 10B to illustrate further steps of data analysis that lead to finding good deals.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: hotels-vienna.
**Graphs**: .png or .eps

## Ch03B Comparing hotel prices in Europe: Vienna vs London

How can we compare hotel markets over Europe and learn about characteristics of hotel prices? Can we visualize two distributions on one graph? What descriptive statistics would best describe each distribution and their differences? Can we visualize descriptive statistics?

This case study uses the `hotels-europe`

dataset and selects 3-4 star hotels in Vienna and London to compare the distribution of prices for a weekday in November 2017. It illustrates the comparison of distributions and the use of **histograms** and **density plots**. It illustrates the use of some of the most important **descriptive statistics** for quantitative variables and their visualizations, **box plots** and **violin plots**.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: hotels-europe.
**Graphs**: .png or .eps

## Ch03C Measuring home team advantage in football

Is there such a thing as home team advantage in professional football (soccer)? That is, do teams that play in their home stadium tend to perform better? And how should we measure better performance?

This case study uses the `football`

dataset, with data on the games played in the English Premier League (EPL) during the 2016/17 season. The case study shows the use of **exploratory data analysis** to answer a substantive question and introduces guidelines to present statistics in a good table.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: football).
**Graphs**: .png or .eps

## Ch03D Distributions of body height and income

Are the distributions of body heigh and family income well approximated by theoretical distributions? Answering these questions can help characterize their distributions and provide guidance for future analysis on how to use these variables.

In this very short case study, we examine survey data collected by the Health and Retirement Study in the U.S.A. in 2014 (`height-income-distributions`

dataset). We show that the height of women aged 55-60 can be described by the **normal distribution**, whereas the income of their households is reasonably well characterized by the **lognormal distribution**.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: height-income-distributions.
**Graphs**: .png or .eps

## Ch04A Management quality and firm size: describing patterns of association

Are larger companies better managed? We want to explore the association between management quality and firm size in a particular country (Mexico). To answer this question we need to define the *y* and *x* variables in this comparison. In particular, we need to assess how the variables in the dataset correspond to the abstract concepts of management quality and firm size.

This case study uses the Mexican subsample of the World Management Survey dataset (`wms-management-survey`

) from 2013. It illustrates how we can measure **latent variables** by **proxy variables** in the data and uncover patterns of association betewen those variables. It also illustrates the concepts of **conditional probability**, **conditional distribution**, and **joint distribution**. The case study introduces informative ways to visualize various aspects of patterns of association, such as the **stacked bar chart**, the **scatterplot**, the **bin scatter**, and comparing **box plots** and **violin plots**. We have introduced the data used here in case study 1C.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: wms-management-survey.
**Graphs**: .png or .eps

## CH05A What likelihood of loss to expect on a stock portfolio?

Can we find out the future likelihood of a large loss on a stock portfolio based on data from the past? We choose the S&P 500 stock market index as our investment portfolio, and we defining a large loss as an at least 5% drop in returns from one day to another. We can easily calculate the proportion of such days in the data, but we are interested in future losses not past ones. To answer our question we need to make generalizations from our data. Such generalizations are bound to bring uncertainty, and we would like to quantify that uncertainty, too.

This case study uses the `sp500`

dataset that covers day-to-day returns on the S&P 500 stock market index for 11 years to illustrate how we can generalize an estimated statistic from a particular dataset to the **population**, or **general pattern**, it represents, and beyond, to the general pattern we are interested in. The case study illustrates the concept of **repeated samples**. It shows how to estimate the **standard error** by **bootstrap** or using a formula, and how to construct and interpret a **confidence interval**. It also illustrates how to think about **external validity**. Case study 6B will use the same data to answer a related, but slightly different question.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: sp500.
**Graphs**: .png or .eps

## CH06A Comparing online and offline prices: testing the difference

Do online and offline prices of the same products tend to be the same? Answering this question can help make better purchase choices, understand the business practices of retailers, and it can inform whether we can use online data in approximating offline prices for policy analysis.

This case study uses the `billion-prices`

dataset. We examine online and offline prices of retail products in the U.S. in 2015-16. The case study illustrates how to translate a more abstract question into an inquiry about a **statistic** (here the average difference). It shows how to formulate a **null hypothesis** and an **alternative hypothesis** and how to carry out a **hypothesis test** in two ways, by calculating the **t-statistic** and comparing it to an appropriate **critival value**, or, alternatively, by using the **p-value**. The case study also illustrates the perils of testing **multiple hypotheses** and **p-hacking**. We have introduced the data used here in case study 1B.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: billion-prices.
**Graphs**: .png or .eps {:target=”_blank”}

## CH06B Testing the likelihood of loss on a stock portfolio

Will our investment portfolio suffer a large loss with a higher chance than what we can accept? When we want to know what’s the likelihood of large future losses on our portfolio, we can use the confidence interval to quantify the uncertainty from estimating it from data on past returns. But we can ask a more pointed question, too: whether our stock portfolio is will suffer large future losses more often than we can accept. To answer that question we need a different procedure: testing a hypothesis.

This case study uses the `sp500`

dataset that covers day-to-day returns for 11 years to illustrate how we can test whether a likelihood is greater or less than a specified value. It illustrates **testing proportions** and how to formulate and carry out a **one-sided hypothesis test**. The case study is a continuation of case study 5A, using the same data.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: sp500.
**Graphs**: .png or .eps

# PART II: REGRESSION ANALYSIS

## CH07A Finding a good deal among hotels with simple regression

How can we find the hotels that are underpriced relative to their distance from the city center? Continuing the previous case studies that resulted in a clean data table ready for analysis and explored the main variables, we need to uncover how hotel price is related to distance to the city center to know what price to expect at what distances. Then can we identify hotels that are the most underpriced compared to their expected price.

This case study uses the `hotels-vienna`

dataset to illustrate regression analysis with one right-hand-side variable. It shows the use of **bin scatters** and **lowess non-parametric regressions** that reveal qualitative patterns of association. In order to find out the quantitative relationship between distance and average price, we apply **simple linear regression**. The case study illustrates the use of **predicted values** and **regression residuals**

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: hotels-vienna.
**Graphs**: .png or .eps

## CH08A Finding a good deal among hotels with non-linear function

Continuing our search for the best hotel deals in Vienna, we would like to uncover the shape of the price-distance association to get at the best estimates of expected prices at various distances. But what’s the best way to compare prices? Should we compare their absoulte values, or should we aim for a relative comparison, such as percent differences? And how can we do the latter in a regression using cross-sectional data?

This short case study again uses the `hotels-vienna`

dataset, to illustrate linear regression analysis with the use of **logarithms**. It shows whether and why it may make sense to take logs of the variables in the regression, and how to estimate, and interpret the results of, and choose from **level-log** regressions, **log-level** regressions, and **log-log** regressions.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: hotels-vienna.
**Graphs**: .png or .eps

## CH08B How is life expectancy related to the average income of a country?

People tend to live longer in richer countries. How long people live is usually measured by life expectancy; how rich a country is usually captured by its yearly income, measured by GDP. But should we use total GDP or GDP per capita? And what’s the shape of the patterns of association? Is the same percent difference in income related to the same difference in how long people live among richer countries and poorer countries? Finding the shape of the association helps benchmarking life expectancy among countries with similar levels of income and identify countries where people tend to live especially long or especially short lives for their income.

This case study uses the `worldbank-lifeexpectancy`

dataset based on the World Development Index database available at the World Bank webside. It examines cross-sectional data from a single year, 2017, for 182 countries. The case study illustrates the choice between **total and per capita measures** (here GDP), regressions with **variables in logs**, and two ways to model nonlinear patterns in the framework of the linear regression: **piecewise linear splines**, and **polynomials**. It also illustrates whether and how to use **weights** in regression analysis, and what that choice implies for the correct interpretation of the results. The case study also shows how to use **informative visualization** to present the results of regressions.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: worldbank-lifeexpectancy.
**Graphs**: .png or .eps

## CH08C Measurement error in hotel ratings

When we search for a good deal among hotels, we care about hotel quality as well as distance to the city center. Online price comparison websites collect customer ratings and publish the average of those ratings, which can serve as a measure of quality. But some averages are based on very few ratings while others are based on hundreds or thousands of ratings. Should we be concerned about ratings coming from very few customers? In particular, what are the consequences of that feature of the data on the results of regression analysis?

This short case study again uses the `hotels-vienna`

dataset, to illustrate the consequences of measurement error for regression analysis. In particular, it shows the effect of **classical measurement error** in the right-hand-side variable on the estimated slope of a simple linear regression.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: hotels-vienna.
**Graphs**: .png or .eps

## CH09A Estimating gender and age differences in earnings

Do women working in the same occupation tend to earn the same as men? And what are the differences in earnings by age? Understanding these differences may help students know what to expect when choosing a particular career.

This case study uses the `cps-morg`

dataset, a cross-section based on the Current Population Survey (CPS) of the U.S. in 2014. It focuses on a single occupation potentially relevant for many students of data analysis, “Market research analysts and marketing specialists”. The case study illustrates how to estimate the **standard error** of regression coefficients and how to construct and interpret **confidence intervals**. It also shows how to **test hypotheses about regression coefficients** and the standard way of **presenting regression results** in tables. We will ues a larger subsample of the same data in case study 10A to uderstand the sources of gender difference in earnings.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: cps-morg.
**Graphs**: .png or .eps

## CH09B How stable is the hotel price–distance to center relationship?

We have uncovered the average price - distance association among hotels in a particular city on a particular date. How generalizable is this pattern to other dates, to other cities, and to other types of accommodations?

This case study uses the `hotels-europe`

data from Vienna, Amsterdam and Barcelona. It illustrates the various kinds of **issues with external validity**, first focusing on time (different dates), then space (different cities), and groups of observations (different kinds of accommodations).

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: hotels-europe.
**Graphs**: .png or .eps

## CH10A Understanding the gender difference in earnings

Women earn less, on average, than man with similar qualifications. How large is that difference among employees with a graduate degree? How does that difference vary with age? And how much do characteristics of the employers and family circumstances of the employees explain of the difference? Understanding the magnitude, patterns, and causes of gender differences in earnings is important from the viewpoint of social equity as well as efficient allocation of labor.

This short case study uses the `cps-morg`

dataset to illustrate the use of multiple regression analysis to help understand the sources of differences between groups of observations. The data is a cross-section based on the Current Population Survey (CPS) of the U.S. in 2014, and the sample is restricted to employees with a graduate degree. The case study illustrates how to estimate and intepret the results of a **multiple regression**. It shows how to include **qualitative right-hand-side variables** and **interactions** in the regression, how to interpret their results, and how to use **visualization** to present estimtes of nonlinear patterns. The case study illustrates the difficulty of uncovering **causal relationships** from the results of multiple regression analysis using cross-sectional observational data.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: cps-morg.
**Graphs**: .png or .eps

## CH10B Finding a good deal among hotels with multiple regression

We return to estimating a good deal among hotels for the last time. We want to find the hotels that are underpriced for their quality and distance to the city center. To do so we first need to uncover expected prices at various levels of distance and quality in a way that reflects all important patterns in the data. Then can we look for hotels that are the most underpriced relative to their expected price.

This case study uses `hotels-vienna`

dataset to illustrate the use of multiple regression analyis for prediction within a sample and **residual analysis**. It uses the susample of 3-4 star hotels for a single night in Vienna in November 2017. It illustrates the use of a **nonlinear specification** within a multiple regression and how to identify observations with the largest negative **residuals**. It also illustrates the use of the ** y-hat - y plot** to visualize the prediction within the sample and the residuals from the predicted values.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: hotels-vienna.
**Graphs**: .png or .eps

## CH11A Does smoking pose a health risk?

Are smokers less likely to remain healthy than non-smokers? How about former smokers who quit?

This case study uses the `share-health`

data from the SHARE survey (Survey for Health, Aging and Retirement in Europe). We focus on people who were 50 to 60 years old and said to be in good health in 2011. We look at how they rated their health in 2015 and see who remained healthy ahd who changed their answer to not healthy. This case study illustrates probability models. It shows how to estimate and interpret the results of a **linear probability model** and the uses of **logit** and **probit** models. It compares the linear probability estimates to the estimated **marginal differences** from logit and probit. Finally, it illustrates when and how the different models may result in different **predicted probabilities** and how to compare their fit using **Brier-score** and other measures of fit.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: share-health.
**Graphs**: .png or .eps

## CH11B Are Australian weather forecasts well-calibrated?

Should we take an umbrella when weather forecast predicts rain? In particular, how should we trust the weather forecast when it predicts a certain the likelihood of rain? For example, is it true that it rains on 20 percent of the days when it says the likelihood is 20 percent?

This short case study uses the `australia-weather-forecast`

data covering 350 days in 2015/16 and looks at rain forecast and actual rain for the Northern Australian city of Darwin. The case study illustrates how to construct and interpret a **calibration curve**.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: australia-weather-forecast.
**Graphs**: .png or .eps

## CH12A Returns on a company stock and market returns

How do monthly returns on a company stock move together with monthly market returns? The strength of this association is a good measure of how risky the company stock is.

This case study uses the `stocks-sp500`

dataset covering 21 years of daily data of many company stocks, focusing on the Microsoft stock and the S&P 500 stock market index. We construct monthly time series of percent returns as the percent change in closing price on the last day of each month. The case study illustrates the use of a simple **time series regression** in changes, focusing on the interpretation and visualization of the results.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: stocks-sp500.
**Graphs**: .png or .eps

## CH12B Electricity consumption and temperature

How does temperature affect residential electricity consumption? Answering this question can help planning for electricity production and assess the potential effects of climate on electricity use.

This case study uses the `arizona-electricity`

dataset that that covers 17 years of monthly electricity consumption data from the state of Arizona in the USA and monthly temperature data from a weather station in its largest city, Phoenix. Using transformed variables of average “cooling degrees” and average “heating degrees” per month, we estimate time series regressions in changes and with and without season dummies. This case study illustrates how to estimate and intepret the results of **times series regressions** specified in **changes**. It shows how to handle and interpret **seasonality** and **lagged associations**, and how to use **Newey-West standard errors** or include **lagged dependent variables** to estimate standard errors that are tobust to **serial correlation** in time series regressions.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: arizona-electricity.
**Graphs**: .png or .eps

# PART III: PREDICTION

## CH13A Predicting used car value with linear regressions

For how much can we expect to sell our used car? And what could price we expect if we waited a year or more? With appropriate data on similar used cars we can estimate various regression models to predict expected price as a function of its features. But how should we select the best regression model for prediction?

This case study uses the `used-cars`

dataset with data from classified ads of used cars from various cities of the U.S.A. in 2018. We select a single model and a single city. The variables include the ask price and various features (age, odometer, cylinders, condition, etc.). We specify several linear regression models to predict the expected price as a function of car features. This case study illustrates the basic logic of carrying out **predictive data analysis** and **model selection**, emphasizing the need to achieve a good fit in the **live data** by selecting a model using the **original data** and avoiding both **underfitting** and **overfitting** the data. It illustrates the use of a **loss function** such as **mean squared error (MSE)** as a measure of fit, and it discusses alternative model selection strategies such as the **BIC**, the **training-test split**, and its improved version, **k-fold cross-validation**.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: used-cars.
**Graphs**: .png or .eps

## CH14A Predicting used car value: log prices

Continuing with our example of predicting used car prices, how should we decide on whether to transform our target variable? In particular, we can speficy regression models with log price instead of price as the target variable. How to make predictions about price when the target variable is in logs, and how to choose between models with log price versus price as the target variable?

This short case study uses the same `used-cars`

dataset as case study 13A with used car data from several cities in the USA in 2018. The case study illustrates prediction with a **target variable in logs**. In particular, it shows how to apply **log correction** to predict a *y* variable when the model is specified in *ln(y)* and how to construct appropriate **prediction intervals**. The case study is a continuation of case study 13A, using the same data, and case study 15A uses the same data, too, to illustrate an alternative predictive model.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: used-cars.
**Graphs**: .png or .eps

## CH14B Predicting AirBnB apartment prices: selecting a regression model

London, UK is a popular tourist destination for business and leisure. We want to predict the rental price of an apartment offered by AirBnB in Hackney, a London borough. The results of this prediction can help tourists choose an offer that is underpriced for its features or apartment owners to deciding on what price they could expect if they rented out their apartment on AirBnB.

This case study uses the `airbnb`

dataset that includes rental prices for one night in March 2017 in greater London, and selects a specific borough. After sample design, we specify linear regressions of varing complexity and a model with LASSO. The case study illustrates the various methods of **building regression models**, including **LASSO**, and the use of a **holdout sample** for **evaluating** the prediction using the best model.

**Code**: **Stata-prep**, **Stata-study** or **R-prep**, **R-study** or **Python** or ALL.
**Data**: airbnb.
**Graphs**: .png or .eps

## CH15A Predicting used car value with regression trees

Further continuing with our example of predicting used car prices, is there a better method for prediction than regression? Ideally, such a method would be better than linear regression at capturing the most important nonlinear patterns and interactions between feature variables and arrive at better predictions. The regression tree promises to be such an alternative, but how does it compare to linear regression in an actual prediction?

This case study uses the `used-cars`

dataset from 2018 and its combined Chcicago and Los Angeles subsamples on a specific model, to illustrate regression trees. We grow several regression trees and compare their predictive performance with the performance of linear regressions. This case study illustrates how we can grow a **regression tree** with the help of the **CART algorithm**, why we can think of a **regression tree as a nonparametric regression**, and how such a regression tree could **overfit** the original data even with **stopping rules** or **pruning**. The case study is a continuation of case studies 13A and 14a, using the same data source but a larger subsample of the observations.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: used-cars.
**Graphs**: .png or .eps

## CH16A Predicting apartment prices with random forest

Continuing with our question of how to predict AirBnB apartment prices in London, UK, we want to build the best model for prediction. In particular, we want to see how two different methods that combine many regression trees compare to each other, to the single regression tree, and to linear regressions.

We use the `airbnb`

dataset that includes rental prices for one night in March 2017 from the area of Greater London. Using apartment location and various features of accommodation as predictors, we carry out feature engineering and build random forest models and gradient boosting machine method (GBM) models, both ((ensemble methods** that use **many regression trees**. This case study illustrates prediction with **random forest** and **boosting** and the evaluation of such predictions. It shows how to carry out necessary **feature engineering**, how to set various **tuning parameters** for the different methods and how those affect the predictions. It also illustrates the use of **variance importance plots** and **partial dependence plots** to help understand the patterns of association that drive the predicitons in these **black box models**. The case study is a continuation of case study 14B, using the same data source but the entire London sample instead of a single borough.

**Code**: **Stata** or **R-prep**, **R-study** or **Python** or ALL.
**Data**: airbnb.
**Graphs**: .png or .eps

## CH17A Predicting firm exit: probability and classification

Many companies have relationships with other companies, as suppliers or clients. Whether those other companies stay in business in the future or exit is an important question for them. How can we use data on many companies across the years to predict the probability of their exit? And can we classify them into two groups, companies that are likely to exit and companies that are likely to stay in business?

This case study uses the `bisnode-firms`

dataset, a panel dataset with a large number of companies from specific industries in a European country, to illustrate probability prediction and classification. After a good deal of feature engineering we estimate several logit models to predict the probablity of firm exit and compare their performance by 5-fold cross-validation, choose the best model to describe how well it predicts the probabilities on a holdout sample, and use the predicted probabilities and two alternative methods for classification. This case study illustrates how to carry out **probability predictions**, how to evaluate their goodness of fit and other aspects of predictive performance, how to find an **optimal classification threshold** with the help of a loss function usign a formula or model-dependent cross-validation, and how to use **expected loss** and the **confusion table** to evaluate classifications. It illustrates how the **ROC curve** visualizes the trade-offs of false positive and negative decisions at various classification thresholds, and how to use **random forest for probaility prediction and classification**. The case study is also a good example of potential issues with **external validity** of predictions and how we may detect the possibility of such issues in the original data.

**Code**: **Stata** or **R-prep**, **R-study** or **Python** or ALL.
**Data**: bisnode-firms.
**Graphs**: .png or .eps

## CH18A Forecasting daily ticket sales for a swimming pool

How can we use transaction data to predict the daily volume of sales? In particular, how can we use data on sales terminal data on tickets sold to a swimming pool to predict the number of tickets sold on each day next year?

This case study uses the `swim-transactions`

dataset with transaction-level data from all swimimng pools for many years in Albuquerque, New Mexico, USA, and selects a single swimming pool. The case study illustrates long-term forecasts. We aggregate the data to daily frequency, discuss data issues and how to solve them, specify several regression models, and select the best by cross-validation. The case study illustrates the use of **transaction data** in predictive analytics, **cross-validation with time series data**, the use of **trend** and, especially, **seasonality** in making long-term predictions and the use of the autmated **Prophet** algorithm. It is an example of how evaluating predictions can detect problems that further data work and analysis may solve.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: swim-transactions.
**Graphs**: .png or .eps

## CH18B Forecasting a house price index

How can we use data on past home prices, and possibly other variables, to predict how home prices will change in a particular city in the next months?

This case study uses the `case-shiller-la`

dataset with monthly observations on the Case-Shiller home price index for the city of Los Angeles, California, USA between 2000 and 2017. The dataset also contains monthly time series of the unemployment rate and employment rate. After exploratory data analysis we estimate various ARIMA time series models that use the price index, as well as VAR models that use the unemployment and employment rates as well, and we use appropriate cross-validation to select the best model. The case study illustrates how to make use of **serial correlation** to make short-term forecasts with the help of **ARIMA models**, how to use other variables and their forecasted values in a **vector autoregression (VAR)** model, and how to select the best model by **cross-validation with time series data** that preserves the serial correlation in the data.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: case-shiller-la.
**Graphs**: .png or .eps

# PART IV: CAUSAL ANALYSIS

## CH19A Food and health

Does eating a lot of fruit and vegetables helps remain healthy? Can we use available data on people’s eating habits and health to uncover those effects? What are the most important problems with using such data to answer our question, and can we do anything about them?

This case study uses the `food-health`

dataset, cross-sectional data collected on the health and eating habits of people as part of the National Health and Nutrition Examination Survey (NHANES, USA); we use data from years 2009-2013. We focus on the subsample of people aged 30-59 years old. The case study illustrates how to define an effect using the **potential outcomes framework**, how to use **causal maps** to visualize our assumptions about the causal relationships between variables, how to translate **latent variables** into their measured **proxy variables** that can be used in actual analysis, how to think about the **sources of variation** in the causal variable, and what variables we should **condition on** in an analysis that attempts to uncover the effect. The case study also illustrates the difficulty of uncovering effects from cross-sectional observational data.

**Code**: **Stata-prep**, **Stata-study** or **R-prep**, **R-study** or **Python-prep**, **Python-study** or ALL.
**Data**: food-health.
**Graphs**: .png or .eps

## CH20A Working from home and employee performance

What is the effect of working from home on employee performance? How can we design an experiment that could measure this effect? Once the data is collected from the experiment, how should we assess its quality, estimate the effect, and evaluate the internal and external validity of the results?

This case study uses the `working-from-home`

data, from an experiment that was carried out at a large travel agency in China. The case study illustrates how to design a **field experiment**, what are potential **issues with internal validity** and how to address them in the design or the analysis of the experiment, and how to analyze experimental data. It shows how to check **covariate balance** and how to interpret its results, how to assess **compliance**, and how to use regression analysis to estimate the effects of the experiment. The case study also illustrates how the results of the experiment can be **used in business decisions**, and what issues may arise with the **external validity** of the results.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: working-from-home.
**Graphs**: .png or .eps

## CH20B Fine tuning social media advertising

There are many choices to make when designing an online advertisement, inlcuding text content and details of appearance. Having alternative versions of these details, how can we select the version that would yield the most return?

This case study describes an A/B testing that we carried out on a social media platform. We tested two versions of a text advertising a data analysis program and measured the number of clicks on the ad and the number of actions (leaving one’s email address). The case study illustrates the steps of **designing an A/B test** in general, and **power calculation** or **sample size calculation** in particular. There is no dataset for this case study.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: ab-test-social-media.

## CH21A Founder/family ownership and quality of management

Many firms are owned by their founder or family members of their founder. Are such founder/family owned firms as well managed as other kinds of firms and, if there is a difference, how much of that that is due to their ownership as opposed to something else? Can we uncover that effect using cross-sectional observational data on firms and their management practices?

This case study uses the `wms-survey-management`

dataset that we introduced in case study 1C. It is a large multi-country multi-sector survey of companies, measuring their management practices and other company characteristics. We use the cross-sectional sample collected from 24 countries between 2004 and 2015. The case study illustrates the use of **thought experiments** to clarify what effect we want to measure, how to think about **what variables to condition on**, and how we may **sign the omitted variables bias**. Besides **multiple regression**, it illustrates **exact matching** and **matching on the propensity score**, discussing their feasibility, advantages and disadvantages, and comparing their results. The case study is another example illustrating the difficulty to uncover an effect using cross-sectional observational data.

**Code**: **Stata** or **R-prep**, **R-study** or **Python-prep**, **Python-study** or ALL.
**Data**: wms-survey-management.
**Graphs**: .png or .eps

## CH22A How does a merger between airlines affect prices?

When two companies merge, the new firm has more market power, and it may use that power to increase price or decrease quality. How can we measure the effect of a merger between two firms on the price they charge? How can we use panel data from many markets to uncover this effect?

This case study uses the `US-airlines`

dataset that is based on 10 percent of all tickets sold on the U.S. market, collected and maintained by the U.S. Department of Transportation. We use this data to evaluate the efect of the merger of American Airlines and US Airways. We define markets and aggregate the data to market-year level and compare price changes across markets with and without the two airlines before the merger. The case study illustrates the use of **transaction data** to carry out a market-level analysis, the difficulties of **defining markets**, and using **difference-in-differences** analysis to estimate an effect. It shows how to examine **pre-intervention trends** to assess the **parallel trends assumption**, and how to estimate generalized versions of difference-in-differences analysis adding covariates or using a **quantitative treatment variable**.

**Code**: **Stata** or **R-prep**, **R-study** or **Python-prep**, **Python-study** or ALL.
**Data**: US-airlines.
**Graphs**: .png or .eps

## CH23A Import demand and industrial production

How does import demand of a large country affect the industrial production of a medium-sized open economy? With time series data on imports of the large receiving country and indistrual production of the smaller country, we can estimate a time series regression to uncover the effect. But the the typical time series we can use are not very long, leading to uncertain estimates with wide confidence intervals. How can we use comparable data from other, similar countries to get more precise estimates?

This case study uses the `asia-industry`

dataset with monthly time series of imports to the USA and industrial production in several Asian countries. The case study illustrates the use of time series regression to uncover an effect, including **contemporaneous effects**, **lagged effects** and their sum, **cumulative effects**. It then shows how we can use **pooled time series**, time series of the same varables from similar subjects (here countries), to arrive at more precise estimates of the same effect.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: asia-industry.
**Graphs**: .png or .eps

## CH23B Immunization against measles and saving children

Immunization against measles is an effective way to prevent the disease and may save the lives of children. How can we use data from many countries and several years with immunization and child mortality rates to uncover the effect of immunization on the survival chances of children?

This case study uses the `world-bank-immunization`

dataset with data from the World Development Indicators data website maintained by the World Bank to look at countries’ annual immunization rate and GDP per capita. The case study illustrates **panel data regressions** with **fixed-effects (FE)** and estimated in **first differences (FD)**. It shows how the inclusion of **time dummies** can condition on **aggregate trends** of any form, the need to estimate **clustered standard errors** that are robust to heteroskedasticity as well as serial correlation. It shows that the inclusion of **lagged right-hand-side variables** can help capture lagged effects and, in the case of FD models, estimate **cumilative effects**, and the inclusion of **lead terms** of the right-hand-side variables can capture **pre-intervention trends**. It also shows how including unit-specific cosntants in an FD model can help capture **time trends specific for cross-sectional units**. The case study compares the results of FE and FD regressions and discusses their differences.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: world-bank-immunization.
**Graphs**: .png or .eps

## CH24 Estimating the effect of the 2010 Haiti earthquake on GDP

In January 2010, a strong earthquake hit the Caribbean island country Haiti, with an epicenter very close to the country’s capital. What was the effect of the earthquake on the Haitian economy in the short and the longer run? We can easily measure how total GDP changed in the year of the earthquake and how it evolved in the following years. However, to estimate the effect of the earthquake we need to estimate the counterfactual: how total GDP would have changed if Haiti hadn’t experienced an earthquake. How van we estimate such a counterfactual?

This case study uses the `haiti-earthquake`

dataset with yearly observations of several macro variables for many countries. The case study illustrates **comparative case studies** and how to construct a **synthetic control** observation (here country) from data from other countries to estimate the counterfactual. It shows how to select **donor pool** of observatons similar to the case study observation (Haiti), how to select the variables whose **pre-intervention values** we want to be similar between the case study observation and the synthetic control observation, and how to use the **algorightm** of the synthetic control method to assign **weights** to each observation in the donor pool to construct the synthetic control observation. The case study also illustrates the **visualization** of the results of synthetic control analysis and the potential issues with the method to uncover the counterfactual.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: haiti-earthquake.
**Graphs**: .png or .eps

## CH24 Estimating the impact of replacing football team managers

Success in team sports depends on many things, and the work of the coach, or manager, is likely one of them. When a team performs below expectations, replacing the manager is one of the options teams can consider. How can we use data on all games for several seasons from a professional football (soccer) league and their managers to show how team performance tends to change after a manager is replaced? And how can we use the same data to estimate the counterfactual: how how the performance of low-performing teams would have changed if the manager hadn’t been replaced?

This case study uses the `football`

dataset with all games of the English Premier League (EPL) in 11 seasons and who the team manager was at each game. It illustrates the **event study** method to estimate contemporaneous and lagged effects with **xt panel data**. It shows how we can **select a control group** from all observations that is similar, on average, in pre-intervention variables (here team performance) to estimate the counterfactual post-intervention outcomes, and how to define and select **pseudo-interventions** that are necessary to define the control group. We used the same dataset in case study 2B.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: football).
**Graphs**: .png or .eps