Gabors Data Analysis

Data analysis is a process

Each of our 47 case studies starts with a relevant question and answers it in the end, using real life data and applying the tools and methods covered in the particular chapter.

PART I: DATA EXPLORATION

CH01A Finding a good deal among hotels: data collection

Vienna, Austria is a popular tourist destination for business and leisure. From the hundreds of places that offer accommodation, we want to pick a hotel that is underpriced relative to its location and quality for a weekday in November 2017. Can we use data to help this decision? What kind of data would we need, and how could we get it?

This case study illustrates how to collect appropriate data from the web on multiple offers. More

chapter slides code hotels-vienna

CH01B Comparing online and offline prices: data collection

Do online and offline prices of the same products tend to be the same? To answer that question, we need data on both the online and offline (in store) price of many products. Such data was collected as part of the Billion Prices Project (BPP), an umbrella of multiple projects that collect price data for various purposes using various methods.

This case study illustrates how to combine different data collection methods and what the challenges are with such data collection. More

chapter slides code billion-prices

CH01C Management quality: data collection

How different are firms and other organizations in the terms of their management practices? Is the quality of management related to how large the firms are? Is it affected by whether the owners are the company founders or their families? To answer these, and many related, questions, we need data on management quality. Such data was collected by the World Management Survey (WMS; https://worldmanagementsurvey.org/), an international research intitative to measure the differences in management practices across organizations and countries.

This case study illustrates how to collect data by surveys. More

chapter slides code wms-management-survey

CH02A Finding a good deal among hotels: data preparation

Continuing with our search for a hotel that is underpriced relative to its location and quality in Vienna, we have scraped data from the web, and we’ve got a data table. But how should we start working with this data? In particular, how should we identify hotels, how should we make sure each hotel features only once in the data, and how should we select the variables we would consider for our future analysis?

This case study uses the hotels-vienna dataset to illustrate how to find problems with observations and variables. It illustrates the various types of variables. It shows how to create a tidy data table and how to deal with missing values and duplicates. More

chapter slides code hotels-vienna

CH02B Displaying immunization rates across countries

Immunization against measles is an effective way to prevent the disease and may save the lives of children. But how do various countries fare in terms of their immunization rates? In particular, how should we structure and use data from many countries and many years to analyze immunization rates across countries and years?

This short case study illustrates how to store multi-dimensional data. More

chapter slides code world-bank-immunization

CH02C Identifying successful football managers

The English Premier League (EPL) is the top football (soccer) division in England. Team managers, as coaches are known in football, arguably play a very important role in the success of their teams. How can we use two separate data tables on games and managers to identify the most successful football manager in the EPL?

This case study uses the football dataset that covers all games played in the EPL and data on managers, including which team they worked at and when. More

chapter slides code football

source

CH03A Finding a good deal among hotels: data exploration

Further continuing our search for a good deal (a hotel in Vienna that is underpriced for its location and quality), we’ve got a clean data table and identified the variables we want to analyze. How should we start the analysis? In particular, how should we explore the most important variables, why should we do that, and what conclusions can we draw from such exploratory analysis?

This case study uses the hotels-vienna dataset to illustrate how to describe the distribution of variables and how to use the findings to identify potential problems in the data, such as extreme values. More

chapter slides code hotels-vienna

CH03B Comparing hotel prices in Europe: Vienna vs London

How can we compare hotel markets over Europe and learn about characteristics of hotel prices? Can we visualize two distributions on one graph? What descriptive statistics would best describe each distribution and their differences? Can we visualize descriptive statistics?

This case study uses the hotels-europe dataset and selects 3-4 star hotels in Vienna and London to compare the distribution of prices for a weekday in November 2017. More

chapter slides code hotels-europe

CH03C Measuring home team advantage in football

Is there such a thing as home team advantage in professional football (soccer)? That is, do teams that play in their home stadium tend to perform better? And how should we measure better performance?

This case study uses the football dataset, with data on the games played in the English Premier League (EPL) during the 2016/17 season. The case study shows the use of exploratory data analysis to answer a substantive question and introduces guidelines to present statistics in a good table.

chapter slides code football

CH03D Distributions of body height and income

Are the distributions of body heigh and family income well approximated by theoretical distributions? Answering these questions can help characterize their distributions and provide guidance for future analysis on how to use these variables.

In this very short case study, we examine survey data collected by the Health and Retirement Study in the U.S.A. in 2014 (height-income-distributions dataset). We show that the height of women aged 55-60 can be described by the normal distribution, whereas the income of their households is reasonably well characterized by the lognormal distribution.

chapter slides code height-income-distributions

CH03U1 Size distribution of Japanese cities

What is the size distribution of Japanese cities? Looking at cities with at least 150,000 inhabitants, it follows a power law.

chapter slides code height-income-distributions

CH04A Management quality and firm size: describing patterns of association

Are larger companies better managed? We want to explore the association between management quality and firm size in a particular country (Mexico). To answer this question we need to define the y and x variables in this comparison. In particular, we need to assess how the variables in the dataset correspond to the abstract concepts of management quality and firm size.

This case study uses the Mexican subsample of the World Management Survey dataset (wms-management-survey) from 2013. More

chapter slides code wms-management-survey

CH05A What likelihood of loss to expect on a stock portfolio?

Can we find out the future likelihood of a large loss on a stock portfolio based on data from the past? We choose the S&P 500 stock market index as our investment portfolio, and we defining a large loss as an at least 5% drop in returns from one day to another. We can easily calculate the proportion of such days in the data, but we are interested in future losses not past ones. To answer our question we need to make generalizations from our data. Such generalizations are bound to bring uncertainty, and we would like to quantify that uncertainty, too.

This case study uses the sp500 dataset that covers day-to-day returns on the S&P 500 stock market index for 11 years to illustrate how we can generalize an estimated statistic from a particular dataset to the population, or general pattern, it represents, and beyond, to the general pattern we are interested in. More

chapter slides code sp500

CH06A Comparing online and offline prices: testing the difference

Do online and offline prices of the same products tend to be the same? Answering this question can help make better purchase choices, understand the business practices of retailers, and it can inform whether we can use online data in approximating offline prices for policy analysis.

This case study uses the billion-prices dataset. We examine online and offline prices of retail products in the U.S. in 2015-16. The case study illustrates how to translate a more abstract question into an inquiry about a statistic (here the average difference). More

chapter slides code billion-prices

CH06B Testing the likelihood of loss on a stock portfolio

Will our investment portfolio suffer a large loss with a higher chance than what we can accept? When we want to know what’s the likelihood of large future losses on our portfolio, we can use the confidence interval to quantify the uncertainty from estimating it from data on past returns. But we can ask a more pointed question, too: whether our stock portfolio is will suffer large future losses more often than we can accept. To answer that question we need a different procedure: testing a hypothesis.

This case study uses the sp500 dataset that covers day-to-day returns for 11 years to illustrate how we can test whether a likelihood is greater or less than a specified value. More

chapter slides code sp500

PART II: REGRESSION ANALYSIS

CH07A Finding a good deal among hotels with simple regression

How can we find the hotels that are underpriced relative to their distance from the city center? Continuing the previous case studies that resulted in a clean data table ready for analysis and explored the main variables, we need to uncover how hotel price is related to distance to the city center to know what price to expect at what distances. Then can we identify hotels that are the most underpriced compared to their expected price.

This case study uses the hotels-vienna dataset to illustrate regression analysis with one right-hand-side variable. More

chapter slides code hotels-vienna

CH08A Finding a good deal among hotels with non-linear function

Continuing our search for the best hotel deals in Vienna, we would like to uncover the shape of the price-distance association to get at the best estimates of expected prices at various distances. But what’s the best way to compare prices? Should we compare their absoulte values, or should we aim for a relative comparison, such as percent differences? And how can we do the latter in a regression using cross-sectional data?

This short case study again uses the hotels-vienna dataset, to illustrate linear regression analysis with the use of logarithms. More

chapter slides code hotels-vienna

People tend to live longer in richer countries. How long people live is usually measured by life expectancy; how rich a country is usually captured by its yearly income, measured by GDP. But should we use total GDP or GDP per capita? And what’s the shape of the patterns of association? Is the same percent difference in income related to the same difference in how long people live among richer countries and poorer countries? Finding the shape of the association helps benchmarking life expectancy among countries with similar levels of income and identify countries where people tend to live especially long or especially short lives for their income.

This case study uses the worldbank-lifeexpectancy dataset based on the World Development Index database available at the World Bank webside. It examines cross-sectional data from a single year, 2017, for 182 countries. More

chapter slides code worldbank-lifeexpectancy

CH08C Measurement error in hotel ratings

When we search for a good deal among hotels, we care about hotel quality as well as distance to the city center. Online price comparison websites collect customer ratings and publish the average of those ratings, which can serve as a measure of quality. But some averages are based on very few ratings while others are based on hundreds or thousands of ratings. Should we be concerned about ratings coming from very few customers? In particular, what are the consequences of that feature of the data on the results of regression analysis?

This short case study again uses the hotels-vienna dataset, to illustrate the consequences of measurement error for regression analysis. More

chapter slides code hotels-vienna

CH09A Estimating gender and age differences in earnings

Do women working in the same occupation tend to earn the same as men? And what are the differences in earnings by age? Understanding these differences may help students know what to expect when choosing a particular career.

This case study uses the cps-morg dataset, a cross-section based on the Current Population Survey (CPS) of the U.S. in 2014. More

chapter slides code cps-morg

CH09B How stable is the hotel price–distance to center relationship?

We have uncovered the average price - distance association among hotels in a particular city on a particular date. How generalizable is this pattern to other dates, to other cities, and to other types of accommodations?

This case study uses the hotels-europe data from Vienna, Amsterdam and Barcelona. More

chapter slides code hotels-europe

CH10A Understanding the gender difference in earnings

Women earn less, on average, than man with similar qualifications. How large is that difference among employees with a graduate degree? How does that difference vary with age? And how much do characteristics of the employers and family circumstances of the employees explain of the difference? Understanding the magnitude, patterns, and causes of gender differences in earnings is important from the viewpoint of social equity as well as efficient allocation of labor.

This short case study uses the cps-morg dataset to illustrate the use of multiple regression analysis to help understand the sources of differences between groups of observations. More

chapter slides code cps-morg

CH10B Finding a good deal among hotels with multiple regression

We return to estimating a good deal among hotels for the last time. We want to find the hotels that are underpriced for their quality and distance to the city center. To do so we first need to uncover expected prices at various levels of distance and quality in a way that reflects all important patterns in the data. Then can we look for hotels that are the most underpriced relative to their expected price.

This case study uses hotels-vienna dataset to illustrate the use of multiple regression analyis for prediction within a sample and residual analysis. More

chapter slides code hotels-vienna

CH11A Does smoking pose a health risk?

Are smokers less likely to remain healthy than non-smokers? How about former smokers who quit?

This case study uses the share-health data from the SHARE survey (Survey for Health, Aging and Retirement in Europe). More

chapter slides code share-health

CH11B Are Australian weather forecasts well-calibrated?

Should we take an umbrella when weather forecast predicts rain? In particular, how should we trust the weather forecast when it predicts a certain the likelihood of rain? For example, is it true that it rains on 20 percent of the days when it says the likelihood is 20 percent?

This short case study uses the australia-weather-forecast data covering 350 days in 2015/16 and looks at rain forecast and actual rain for the Northern Australian city of Darwin. More

chapter slides code australia-weather-forecast

CH12A Returns on a company stock and market returns

How do monthly returns on a company stock move together with monthly market returns? The strength of this association is a good measure of how risky the company stock is.

This case study uses the stocks-sp500 dataset covering 21 years of daily data of many company stocks, focusing on the Microsoft stock and the S&P 500 stock market index. More

chapter slides code stocks-sp500

CH12B Electricity consumption and temperature

How does temperature affect residential electricity consumption? Answering this question can help planning for electricity production and assess the potential effects of climate on electricity use.

This case study uses the arizona-electricity dataset that that covers 17 years of monthly electricity consumption data from the state of Arizona in the USA and monthly temperature data from a weather station in its largest city, Phoenix. More

chapter slides code arizona-electricity

PART III: PREDICTION

CH13A Predicting used car value with linear regressions

For how much can we expect to sell our used car? And what could price we expect if we waited a year or more? With appropriate data on similar used cars we can estimate various regression models to predict expected price as a function of its features. But how should we select the best regression model for prediction?

This case study uses the used-cars dataset with data from classified ads of used cars from various cities of the U.S.A. in 2018. More

chapter slides code used-cars

CH14A Predicting used car value: log prices

Continuing with our example of predicting used car prices, how should we decide on whether to transform our target variable? In particular, we can speficy regression models with log price instead of price as the target variable. How to make predictions about price when the target variable is in logs, and how to choose between models with log price versus price as the target variable?

This short case study uses the same used-cars dataset as case study 13A with used car data from several cities in the USA in 2018. More

chapter slides code used-cars

CH14B Predicting AirBnB apartment prices: selecting a regression model

London, UK is a popular tourist destination for business and leisure. We want to predict the rental price of an apartment offered by AirBnB in Hackney, a London borough. The results of this prediction can help tourists choose an offer that is underpriced for its features or apartment owners to deciding on what price they could expect if they rented out their apartment on AirBnB.

This case study uses the airbnb dataset that includes rental prices for one night in March 2017 in greater London, and selects a specific borough. More

chapter slides code airbnb

CH15A Predicting used car value with regression trees

Further continuing with our example of predicting used car prices, is there a better method for prediction than regression? Ideally, such a method would be better than linear regression at capturing the most important nonlinear patterns and interactions between feature variables and arrive at better predictions. The regression tree promises to be such an alternative, but how does it compare to linear regression in an actual prediction?

This case study uses the used-cars dataset from 2018 and its combined Chcicago and Los Angeles subsamples on a specific model, to illustrate regression trees. More

chapter slides code used-cars

CH16A Predicting apartment prices with random forest

Continuing with our question of how to predict AirBnB apartment prices in London, UK, we want to build the best model for prediction. In particular, we want to see how two different methods that combine many regression trees compare to each other, to the single regression tree, and to linear regressions.

We use the airbnb dataset that includes rental prices for one night in March 2017 from the area of Greater London. More

chapter slides code airbnb

CH17A Predicting firm exit: probability and classification

Many companies have relationships with other companies, as suppliers or clients. Whether those other companies stay in business in the future or exit is an important question for them. How can we use data on many companies across the years to predict the probability of their exit? And can we classify them into two groups, companies that are likely to exit and companies that are likely to stay in business?

This case study uses the bisnode-firms dataset, a panel dataset with a large number of companies from specific industries in a European country, to illustrate probability prediction and classification. More

chapter slides code bisnode-firms

CH18A Forecasting daily ticket sales for a swimming pool

How can we use transaction data to predict the daily volume of sales? In particular, how can we use data on sales terminal data on tickets sold to a swimming pool to predict the number of tickets sold on each day next year?

This case study uses the swim-transactions dataset with transaction-level data from all swimimng pools for many years in Albuquerque, New Mexico, USA, and selects a single swimming pool. More

chapter slides code swim-transactions

CH18B Forecasting a house price index

How can we use data on past home prices, and possibly other variables, to predict how home prices will change in a particular city in the next months?

This case study uses the case-shiller-la dataset with monthly observations on the Case-Shiller home price index for the city of Los Angeles, California, USA between 2000 and 2017. More

chapter slides code case-shiller-la

CH19A Food and health

Does eating a lot of fruit and vegetables helps remain healthy? Can we use available data on people’s eating habits and health to uncover those effects? What are the most important problems with using such data to answer our question, and can we do anything about them?

This case study uses the food-health dataset, cross-sectional data collected on the health and eating habits of people as part of the National Health and Nutrition Examination Survey (NHANES, USA); we use data from years 2009-2013. More

chapter slides code food-health

CH20A Working from home and employee performance

What is the effect of working from home on employee performance? How can we design an experiment that could measure this effect? Once the data is collected from the experiment, how should we assess its quality, estimate the effect, and evaluate the internal and external validity of the results?

This case study uses the working-from-home data, from an experiment that was carried out at a large travel agency in China. More

chapter slides code working-from-home

There are many choices to make when designing an online advertisement, inlcuding text content and details of appearance. Having alternative versions of these details, how can we select the version that would yield the most return?

This case study describes an A/B testing that we carried out on a social media platform. More

chapter slides code ab-test-social-media

CH21A Founder/family ownership and quality of management

Many firms are owned by their founder or family members of their founder. Are such founder/family owned firms as well managed as other kinds of firms and, if there is a difference, how much of that that is due to their ownership as opposed to something else? Can we uncover that effect using cross-sectional observational data on firms and their management practices?

This case study uses the wms-survey-management dataset that we introduced in case study 1C. More

chapter slides code wms-management-survey

CH22A How does a merger between airlines affect prices?

When two companies merge, the new firm has more market power, and it may use that power to increase price or decrease quality. How can we measure the effect of a merger between two firms on the price they charge? How can we use panel data from many markets to uncover this effect?

This case study uses the US-airlines dataset that is based on 10 percent of all tickets sold on the U.S. market, collected and maintained by the U.S. Department of Transportation. More

chapter slides code US-airlines

CH23A Import demand and industrial production

How does import demand of a large country affect the industrial production of a medium-sized open economy? With time series data on imports of the large receiving country and indistrual production of the smaller country, we can estimate a time series regression to uncover the effect. But the the typical time series we can use are not very long, leading to uncertain estimates with wide confidence intervals. How can we use comparable data from other, similar countries to get more precise estimates?

This case study uses the asia-industry dataset with monthly time series of imports to the USA and industrial production in several Asian countries. More

chapter slides code asia-industry

CH23B Immunization against measles and saving children

Immunization against measles is an effective way to prevent the disease and may save the lives of children. How can we use data from many countries and several years with immunization and child mortality rates to uncover the effect of immunization on the survival chances of children?

This case study uses the world-bank-immunization dataset with data from the World Development Indicators data website maintained by the World Bank to look at countries’ annual immunization rate and GDP per capita. More

chapter slides code world-bank-immunization

CH24 Estimating the impact of replacing football team managers

Success in team sports depends on many things, and the work of the coach, or manager, is likely one of them. When a team performs below expectations, replacing the manager is one of the options teams can consider. How can we use data on all games for several seasons from a professional football (soccer) league and their managers to show how team performance tends to change after a manager is replaced? And how can we use the same data to estimate the counterfactual: how the performance of low-performing teams would have changed if the manager hadn’t been replaced?

This case study uses the football dataset with all games of the English Premier League (EPL) in 11 seasons and who the team manager was at each game. More

chapter
slides
code
football

Data analysis is a process

PART I: DATA EXPLORATION

CH01A Finding a good deal among hotels: data collection

CH01B Comparing online and offline prices: data collection

CH01C Management quality: data collection

CH02A Finding a good deal among hotels: data preparation

CH02B Displaying immunization rates across countries

CH02C Identifying successful football managers

CH03A Finding a good deal among hotels: data exploration

CH03B Comparing hotel prices in Europe: Vienna vs London

CH03C Measuring home team advantage in football

CH03D Distributions of body height and income

CH03U1 Size distribution of Japanese cities

CH04A Management quality and firm size: describing patterns of association

CH05A What likelihood of loss to expect on a stock portfolio?

CH06A Comparing online and offline prices: testing the difference

CH06B Testing the likelihood of loss on a stock portfolio

PART II: REGRESSION ANALYSIS

CH07A Finding a good deal among hotels with simple regression

CH08A Finding a good deal among hotels with non-linear function

CH08B How is life expectancy related to the average income of a country?

CH08C Measurement error in hotel ratings

CH09A Estimating gender and age differences in earnings

CH09B How stable is the hotel price–distance to center relationship?

CH10A Understanding the gender difference in earnings

CH10B Finding a good deal among hotels with multiple regression

CH11A Does smoking pose a health risk?

CH11B Are Australian weather forecasts well-calibrated?

CH12A Returns on a company stock and market returns

CH12B Electricity consumption and temperature

PART III: PREDICTION

CH13A Predicting used car value with linear regressions

CH14A Predicting used car value: log prices

CH14B Predicting AirBnB apartment prices: selecting a regression model

CH15A Predicting used car value with regression trees

CH16A Predicting apartment prices with random forest

CH17A Predicting firm exit: probability and classification

CH18A Forecasting daily ticket sales for a swimming pool

CH18B Forecasting a house price index

CH19A Food and health

CH20A Working from home and employee performance

CH20B Fine tuning social media advertising

CH21A Founder/family ownership and quality of management

CH22A How does a merger between airlines affect prices?

CH23A Import demand and industrial production

CH23B Immunization against measles and saving children

CH24 Estimating the impact of replacing football team managers