# Part III: PREDICTION

# PART III: PREDICTION

## CH13A Predicting used car value with linear regressions

For how much can we expect to sell our used car? And what could price we expect if we waited a year or more? With appropriate data on similar used cars we can estimate various regression models to predict expected price as a function of its features. But how should we select the best regression model for prediction?

This case study uses the `used-cars`

dataset with data from classified ads of used cars from various cities of the U.S.A. in 2018. We select a single model and a single city. The variables include the ask price and various features (age, odometer, cylinders, condition, etc.). We specify several linear regression models to predict the expected price as a function of car features. This case study illustrates the basic logic of carrying out **predictive data analysis** and **model selection**, emphasizing the need to achieve a good fit in the **live data** by selecting a model using the **original data** and avoiding both **underfitting** and **overfitting** the data. It illustrates the use of a **loss function** such as **mean squared error (MSE)** as a measure of fit, and it discusses alternative model selection strategies such as the **BIC**, the **training-test split**, and its improved version, **k-fold cross-validation**.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: used-cars.
**Graphs**: .png or .eps

## CH14A Predicting used car value: log prices

Continuing with our example of predicting used car prices, how should we decide on whether to transform our target variable? In particular, we can speficy regression models with log price instead of price as the target variable. How to make predictions about price when the target variable is in logs, and how to choose between models with log price versus price as the target variable?

This short case study uses the same `used-cars`

dataset as case study 13A with used car data from several cities in the USA in 2018. The case study illustrates prediction with a **target variable in logs**. In particular, it shows how to apply **log correction** to predict a *y* variable when the model is specified in *ln(y)* and how to construct appropriate **prediction intervals**. The case study is a continuation of case study 13A, using the same data, and case study 15A uses the same data, too, to illustrate an alternative predictive model.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: used-cars.
**Graphs**: .png or .eps

## CH14B Predicting AirBnB apartment prices: selecting a regression model

London, UK is a popular tourist destination for business and leisure. We want to predict the rental price of an apartment offered by AirBnB in Hackney, a London borough. The results of this prediction can help tourists choose an offer that is underpriced for its features or apartment owners to deciding on what price they could expect if they rented out their apartment on AirBnB.

This case study uses the `airbnb`

dataset that includes rental prices for one night in March 2017 in greater London, and selects a specific borough. After sample design, we specify linear regressions of varing complexity and a model with LASSO. The case study illustrates the various methods of **building regression models**, including **LASSO**, and the use of a **holdout sample** for **evaluating** the prediction using the best model.

**Code**: **Stata-prep**, **Stata-study** or **R-prep**, **R-study** or **Python** or ALL.
**Data**: airbnb.
**Graphs**: .png or .eps

## CH15A Predicting used car value with regression trees

Further continuing with our example of predicting used car prices, is there a better method for prediction than regression? Ideally, such a method would be better than linear regression at capturing the most important nonlinear patterns and interactions between feature variables and arrive at better predictions. The regression tree promises to be such an alternative, but how does it compare to linear regression in an actual prediction?

This case study uses the `used-cars`

dataset from 2018 and its combined Chcicago and Los Angeles subsamples on a specific model, to illustrate regression trees. We grow several regression trees and compare their predictive performance with the performance of linear regressions. This case study illustrates how we can grow a **regression tree** with the help of the **CART algorithm**, why we can think of a **regression tree as a nonparametric regression**, and how such a regression tree could **overfit** the original data even with **stopping rules** or **pruning**. The case study is a continuation of case studies 13A and 14a, using the same data source but a larger subsample of the observations.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: used-cars.
**Graphs**: .png or .eps

## CH16A Predicting apartment prices with random forest

Continuing with our question of how to predict AirBnB apartment prices in London, UK, we want to build the best model for prediction. In particular, we want to see how two different methods that combine many regression trees compare to each other, to the single regression tree, and to linear regressions.

We use the `airbnb`

dataset that includes rental prices for one night in March 2017 from the area of Greater London. Using apartment location and various features of accommodation as predictors, we carry out feature engineering and build random forest models and gradient boosting machine method (GBM) models, both ((ensemble methods** that use **many regression trees**. This case study illustrates prediction with **random forest** and **boosting** and the evaluation of such predictions. It shows how to carry out necessary **feature engineering**, how to set various **tuning parameters** for the different methods and how those affect the predictions. It also illustrates the use of **variance importance plots** and **partial dependence plots** to help understand the patterns of association that drive the predicitons in these **black box models**. The case study is a continuation of case study 14B, using the same data source but the entire London sample instead of a single borough.

**Code**: **Stata** or **R-prep**, **R-study** or **Python** or ALL.
**Data**: airbnb.
**Graphs**: .png or .eps

## CH17A Predicting firm exit: probability and classification

Many companies have relationships with other companies, as suppliers or clients. Whether those other companies stay in business in the future or exit is an important question for them. How can we use data on many companies across the years to predict the probability of their exit? And can we classify them into two groups, companies that are likely to exit and companies that are likely to stay in business?

This case study uses the `bisnode-firms`

dataset, a panel dataset with a large number of companies from specific industries in a European country, to illustrate probability prediction and classification. After a good deal of feature engineering we estimate several logit models to predict the probablity of firm exit and compare their performance by 5-fold cross-validation, choose the best model to describe how well it predicts the probabilities on a holdout sample, and use the predicted probabilities and two alternative methods for classification. This case study illustrates how to carry out **probability predictions**, how to evaluate their goodness of fit and other aspects of predictive performance, how to find an **optimal classification threshold** with the help of a loss function usign a formula or model-dependent cross-validation, and how to use **expected loss** and the **confusion table** to evaluate classifications. It illustrates how the **ROC curve** visualizes the trade-offs of false positive and negative decisions at various classification thresholds, and how to use **random forest for probaility prediction and classification**. The case study is also a good example of potential issues with **external validity** of predictions and how we may detect the possibility of such issues in the original data.

**Code**: **Stata** or **R-prep**, **R-study** or **Python** or ALL.
**Data**: bisnode-firms.
**Graphs**: .png or .eps

## CH18A Forecasting daily ticket sales for a swimming pool

How can we use transaction data to predict the daily volume of sales? In particular, how can we use data on sales terminal data on tickets sold to a swimming pool to predict the number of tickets sold on each day next year?

This case study uses the `swim-transactions`

dataset with transaction-level data from all swimimng pools for many years in Albuquerque, New Mexico, USA, and selects a single swimming pool. The case study illustrates long-term forecasts. We aggregate the data to daily frequency, discuss data issues and how to solve them, specify several regression models, and select the best by cross-validation. The case study illustrates the use of **transaction data** in predictive analytics, **cross-validation with time series data**, the use of **trend** and, especially, **seasonality** in making long-term predictions and the use of the autmated **Prophet** algorithm. It is an example of how evaluating predictions can detect problems that further data work and analysis may solve.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: swim-transactions.
**Graphs**: .png or .eps

## CH18B Forecasting a house price index

How can we use data on past home prices, and possibly other variables, to predict how home prices will change in a particular city in the next months?

This case study uses the `case-shiller-la`

dataset with monthly observations on the Case-Shiller home price index for the city of Los Angeles, California, USA between 2000 and 2017. The dataset also contains monthly time series of the unemployment rate and employment rate. After exploratory data analysis we estimate various ARIMA time series models that use the price index, as well as VAR models that use the unemployment and employment rates as well, and we use appropriate cross-validation to select the best model. The case study illustrates how to make use of **serial correlation** to make short-term forecasts with the help of **ARIMA models**, how to use other variables and their forecasted values in a **vector autoregression (VAR)** model, and how to select the best model by **cross-validation with time series data** that preserves the serial correlation in the data.

**Code**: **Stata** or **R** or **Python** or ALL.
**Data**: case-shiller-la.
**Graphs**: .png or .eps