Gabors Data Analysis

Data Chats Podcast

2023-01-07T00:00:00+00:00

I did a podcast with Chris Richardson of Pragmatic Institute, a California based data science education institute a while back, it is out now. Data Science in Action: Insights from a Leading Textbook Author and Case Studies Friday Jan 06, 2023.

Also on Apple, Spotify

Pragmatic Institute‘s data podcast covers emerging and relevant topics in data science, data analytics, data engineering and pretty much all things data. They have some cool episodes with fellow book authors Charles Wheelan Alberto Cairo and Cole-Nussbaumer-Knaflic as well as data scientits from GoDaddy or Netflix.

Data Chats summary

“We also wanted to use the textbook to talk about important social issues, like, why do men make more than women? And so, one case study is about understanding the gender gap in the US and why is there a 15, 20% difference in earnings.” - Gábor Békés

In this episode of Data Chats, Chris Richardson interviews Gábor Békés. They discuss:

Detailed process of data analysis
The most important methods, tools, and skills for future data analysts
Real life case-studies included in the book
Implications of data analysis on public policy
Humility required for the data interpretation process

More details

I talk about

How we chose the topics to include.
Why he didn’t include code in the book.
Process of producing the code that is on the website.
Why we chose the case studies that he did.
- One of the cases is around wage differences between men and women.
- What happens when you fire your manager or coach mid-season? Do they improve in the short run?
- Swimming pool tickets—prediction – how many people will come on a given day
- Predict if a firm will default or not in the next two years based on data
- Merger between two airlines in the US figure out the affect on consumers.
- Football (soccer) case, English Premiere League, about changing coaches/managers mid-season.

Some other bits

We thought long and hard about the first sentence of the book?
Data (without the theory the effects are silent) is silent without the analysis. You need to not just have the data, but you have to have the context and understanding of the environment of the problem.

What a student who has read this book will be able to do this after reading the book: 95% of what an analyst needs (as pointed out by Joshua Angrist)

How you start working with your data
How you find patterns
Have a prediction mind set—how do I predict outside my data?
Be able to analyze the causal effects of changes.
Understand that this is longer journey.
What one or two things I can do to better tomorrow.

Data Analysis for Business Analytics

2022-07-15T00:00:00+00:00

A complete Data Analysis package for Business Analytics

MS in Business Analytics, MS Applied Data Science – a variety of names exist for a relatively new breed of Master’s programs. They typically offer a one-year intensive curriculum to build up skills in coding: (#Rstats or #Python), engineering with #SQL, #cloud , #bigdata, #spark and many more as well as management and data analysis.

Data Analysis for Business, Economics, and policy

Data Analysis is a mix of statistics, econometrics, data science, machine learning with the aim of teaching practical skills that analysts may use in real life working in business. As instructor of Data Analysis and program director at CEU MS in Business Analytics for years, I have experimented quite a bit with the curriculum. The experience gave rise to a textbook co-authored with Gabor Kezdi and published by Cambridge University Press. It is currently used in 80+ programs around the world, including a dozen Analytics Master’s programs.

How can this Data Analysis for Business, Economics, and Policy textbook help instructors and students Business Analytics or related programs? We believe it can help in several different aspects.

Four benefits

We think there are four key benefits of using the textbook in Business Analytics education: an approach that aims at highlighting the analytical process, curated content from exploration to machine learning and causal inference, focus on case studies and an ecosystem helping the learning process.

Data analysis is a process

First, it will help students understand that Data analysis is a process. It starts with formulating a question and collecting appropriate data or assessing whether the available data can help answer the question. Then come the tedious but essential tasks of data cleaning and organizing that affect the results of the analysis as much as any other step in the process. Exploratory data analysis gives context to the eventual results and helps deciding the details of the analytical method to be applied. The main analysis consists of choosing and implementing the method to answer the question, with potential robustness checks. Along the way, correct interpretation and effective presentation of the results are crucial. Carefully crafted data visualizations help summarize our findings and convey the key messages. The final task is to answer the original question, with potential qualifications and directions for future inquiries.

Four parts, four courses on data exploration, regressions, prediction and causal inference

Second, it will help you think about a course structure. In our view, a great way to teach the whole process is to organize it in four courses: data exploration, regression analysis, prediction with machine learning, and causal analysis.

Data Exploration: The first course starts with discussing (and possibly doing) data collection and thinking about data quality, followed by organizing and cleaning data, exploratory data analysis and data visualization. This first part must also include the key statistical knowledge needed for everything else: important distributions, statistical inference, correlation and hypothesis testing and some sampling methods. You may offer this as a bootcamp or a pre-session allowing more experienced students to exam out.
Regression analysis Thorough introduction to regression analysis comes next, including probability models and time series regressions. It is vital for data analysis students to learn designing regression models and deeply understand their output. As part of the process, they may learn the role of selecting a functional form, dealing with uncertainty as well as different types of dependent variables.
Prediction with Machine Learning: The third part covers predictive analytics and introduces cross-validation, LASSO, tree-based machine learning methods such as random forest, probability prediction, classification, and forecasting from time series data. This is where modern data science is presented in a prediction context focusing on the “do what works” approach. In this course, only a few methods are presented, but those are discussed in depth. Students may then take additional machine learning classes to master other models like deep learning.
Causal effect of interventions: The fourth part covers causal analysis, starting with the potential outcomes framework as well as DAGs (causal maps). Students discuss the role of randomized experiments as well as practical issues doing them offline and online. Statistical methods include difference-in-differences analysis, various panel data methods and the event study approach.

A case study focus

Third, our approach focuses on explaining how to carry out an actual data analysis project from beginning to end. To cover all the steps that are necessary to carry out an actual data analysis project, we lean on a set of fully developed case studies. While each case study focuses on the method discussed in the chapter, they illustrate all elements of the process from question through analysis to conclusion. Our case studies cover a wide range of topics, with a potential appeal to a wide range of students. They include consumer decision, economic and social policy, finance, business and management, health, and sport. Their regional coverage is also wider than usual: one third is from the U.S.A., one third is from Europe and the U.K., and one third is from other countries or includes all parts of the world, from Australia to Thailand.

The ecosystem

Fourth, there is an ecosystem built into and around the textbook, consisting of

online datasets,
online code in Stata, R, and Python
in text practice questions (360)
in text data exercises (120)
online reading recommendations
online suggested practice datasets
online suggestions to learn coding ideas in R, Stata and Python
complete coding course for Data Analysis in R (NEW)
For instructors, lecture slides at Cambridge UP website, also available in latex/overleaf

Why use it?

In summary, analytic process orientation, comprehensive coverage of key topics, a focus on cases studies and a massive ecosystem (with data, code and even a coding course) makes this package a great value for instructors and students in Business analytics or applied data science programs.

The book is endorsed by two Nobel laureates (Angrist and Card), the dean of Yale School of Management (Charles) and many leading scholars. It is already adopted in over 80 courses globally, including Columbia, U Texas, U Michgan, Penn State, Pittsburgh, U of California, Simon Fraser, UCLondon, Cambridge Judge, Essex, ESSEC, Bocconi, Berlin, CEU Vienna, Budapest, Amsterdam, UC Dublin, UWA Perth, Kyoto.

You may buy the book: Amazon.com, or a great deal of global options
Request an examination copy from the Publisher (or DM me)

Contact us
Follow us on Twitter and Facebook

As noted earlier, data analysis is a process. Hopefully we can help you enjoy the journey.

Interpreting a coefficient in a simple OLS regression

2021-11-29T00:00:00+00:00

Interpreting univariate OLS coefficients

Precise interpretation of a simple univariate, cross-sectional regression is not easy.

There are many good (precise) ways to do it, some that are not perfect and some that are not good. So let us offer an example, and a vareity of good, partially ok, but problematic and simply wrong answers. We also add some comments to explain why an asnwer is good or bad.

The question

You have a representative sample of 10,000 people in a country, aged 15-45. You are interested in the relationship between earning (USD /per year) and age. You run a simple linear regression estimated with OLS. Both $y$ and $x$ are in levels.

\[y^{E} = \hat\alpha + \hat \beta \times age\]

The estimated coefficients are: $\hat \alpha = 7000, \hat\beta = 400$. The task is to interpret both of these coefficients.

Good answers

Let us start with the constant (intercept) $\hat\alpha$

For people aged zero (when age=0), earnings is $7000, on average
For people aged zero (when age=0), the expected earning is $7000
- you may or may not add on average, “expected” includes it
For people aged zero (when age=0), earning is $7000, on average
The constant cannot be interpreted in this context (because newborns make no money)
Intercept of the regression line, in this case has no realistic meaning, no earnings at age = 0

Now let us look at the slope, $\hat\beta$

People who are one year older, earn $400 more, on average
People who are one year older tend to earn $400 more (you may or may not add on average, “expected” kind of includes it)
- … are expected to earn $400 more
- …tend to have higher earnings by $400
One additional year of age is associated with $400 higher earning, on average
One year age difference is associated with and an average of $400 extra earnings
One additional year in age corresponds to an average of $400 extra earnings
Earnings of people who are one year older, are (tend to be / is expected to be) on average $400 higher in the data
Comparing two people, the one who is one year older, is expected to (tend to have) have $400 higher earning

Partial credit, not completely bad but problematic

For constant, $\hat\alpha$

Newborn/Zero aged people earn $7000 ( missed: on average )
Average values of earning without considering the age is $7000 ( we consider it, but it’s zero )
The person earns 7000usd/year at least, no matter what is his age ( true but only because beta is positive )
7000 is the minimum income that has to be given irrespective of the age ( true but only because beta is positive )

For slope, $\hat\beta$

One additional year in age corresponds to $400 higher earning ( missed: on average )
One extra year means (implies) 400 more ( suggests causality, and missed on average )
any extra age adds up 400 to earnings ( suggests causality, and missed on average )
People who are one year older will have $400 higher earnings, on average ( “will have”: the data is about the past, we don’t know what the future brings. Yes, will can mean “likely” but should be avoided )

Not good (bad)

For constant, $\hat\alpha$

The intercept is 7000 ( not interpretation )
Average earning is 7000 if age=15 ( not at the minimum age in the sample )
Average earning is 7000 ( no, 7000 is average earnings at age zero )

For slope, $\hat\beta$

for every unit change in age the change in earning on average is 400USD ( it’s about cross-section differences between people, not changes)
One year increase will get $400 increase in wage ( no time series or causality, no increase! )
Each year in the age increases earnings by $400 ( no time series or causality, no increase! )
The slope is 400 ( not interpretation )

A simplified notation for OLS regression

2021-07-11T00:00:00+00:00

A simplified regression notation

We introduced some new notation in the textbook, to make the formulae simpler and more focused. In particular, our formula for regressions is slightly different from the traditional formula. We think that it is good practice to write out the formula for each regression that is analyzed. For this reason, it important to use a notation for the regression formula that is as simple as possible and focuses only on what we care about. Our notation is intuitive, but it’s slightly different from traditional practice. Let us explain our reasons.

Our approach starts with the definition of the regression: it is a model for the conditional mean. The formulaic definition of the simple linear regression is $E[y \mid x]= \alpha + \beta x$ .

The formulaic definition of a linear regression with three right-hand-side variables is

\[E[y \mid x_1, x_2, x_3]= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3\]

The regression formula we use in the textbook is a simplified version of this formulaic definition. In particular, we have $y^E$ on the left-hand side instead of $E[y \mid ...]$. So $y^E$ is just a shorthand for the expected value of $y$ conditional on whatever is on the right-hand side of the regression.

Thus, the formula for the simple linear regression is $y^E = \alpha + \beta x$, and $y^E$ is the expected value of $y$ conditional on $x$. The formula for the linear regression with three right-hand-side variables is

\[y^E= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3\]

and here $y^E$ is the expected value of $y$ conditional on $x_1$, $x_2$, and $x_3$. Having $y^E$ on the left-hand side makes notation much simpler than writing out the conditional expectation formula $E[y \mid ...]$, especially when we have many right-hand-side variables.

In contrast, the traditional regression formula has the variable $y$ itself on the left-hand side, and has an additional element, the error term. This different notation acknowledges the fact that the actual value of $y$ is equal to its expected value (defined by the model) plus a deviation from it. For example, the traditional formula for the linear regression with three right-hand-side variables is

\[y= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + e\]

Importantly, it does not imply, that the model is about the conditional mean.

Our notation is simpler, because it has fewer elements. More importantly, our notation makes it explicit that the regression is a model for the conditional mean. It focuses on the data that analysts care about (the right-hand-side variables and their coefficients), without adding anything else. Especially for introductory courses, we believe, it’s a better way.

Note that there are reasons to go beyond this notations. With causal analysis, we may come back to the traditional model and have a term capturing unobserved heterogeneity (often denoted by u not e.) For instance, when introducing potential outcomes, it may be useful. For prediction, we may also have a model with explicitly having a prediction error: the difference between actual and predicted value for a given $y_i$.

But, well, for most uses of a regression model, we suggest building on the expected value as a starting point to push the idea of regression as a model of conditional mean, and using our shorthand to simplify notation.

On picking the Viridis color scheme

2020-10-18T00:00:00+00:00

The economist meets visual arts

About two years ago, we had to make final decision about how the graphs in the textbook will loook like. I am an economist, and well, I know about ten colors. I have had no idea what a hue means. I did not really know what a color scheme was.

Then, two things happen. First TV shows about interior design with their discussion of color schemes. My fav show has been BBC’s The Great Interior Design Challenge. This is when I first learnt that what I color blueish green is actually teal. And that there sooo many different shades of a color. Heard about the idea of color palettes and schemes, too.

Second, we had a serious discussion with Gabor K - he is partially color blind (as are about 6% of population) on what he may and may not see. Red and green, not an option. So we needed to decide on how we use colors, and we were set to pick two good looking, color-blind friendly colors. To get started, I talked to my dataviz mentor, Alberto Cairo, who showed me a great application, called Color Oracle that will pinpoint if an image works for color blind folks.

Then, I visited my editors in Cambridge, and I also met technical people, who told me that there are no more 2 or 3-color books, it is either a single color or full color. So if we wanna have colors, we can have as many colors as we’d like. Awesome.

So the summary of all this is that we realized, we can have many colors, but not any colors. And we need a scheme. So we tried to pin down our needs:

For some graphs, we would need many colors not just a few
We want to use colors to represent values on a less-more (few-many) scale
Color blind folks should see them all right
Bonus: a black and white copy shall work, ie light and dark instead of some colors.

Viridis to the rescue

I searched around untill I found a wonderful video on Viridis. This is a conference talk at SciPy2015 by Nathaniel Smith and Stéfan van der Walt. In less than 20 minutes, it explained how vision works, what could go wrong and why viridis is a solution. Viridis is a scheme first developed for Matplotlib in Python, but now available in R and Stata as well.

When we needes a set of pre-defined colors for scatterplots or line graphs we picked 5 colors from the scheme (here with their HEX code)

And, well, that is basically it. We then went on and had the textbook key colors be also based on Viridis. But that will be another post…

reference

The viridis color palette in R by Bob Rudis, Noam Ross and Simon Garnier Matplotlib colormaps, Option “D”.

Goalkeepers, random play and multiple testing

2020-07-18T00:00:00+00:00

Random action may be a well thought out strategy. Sometimes you would want to act randomly to make your next action hard to predict. This is particularly important when there is an adverserial situation, and players must choose a strategy. In game theory, this idea of picking an action from a set randomly is called mixed stratgy One such case is saving penalties.

How can we test if goalkeepers play mixed strategies? Well, we can test if they move randomly, and test randomness for a large set of goalkeepers.

Stats and the beautiful game

Testing the hypothesis of random action across goalkeepers is what Ignacio Palacios-Huerta carried out in his wonderful book (Beautiful Game Theory: How Soccer Can Help Economics.

In the very first chapter, the author looked at penalties, from the point of view of randomness. Goalkeepers often decide to move before the ball is shot. Having collected data on where the goalkeeper moved, he tests if moves (such as left and up or right and down) are random. Random action is mixed strategy. For 40 keepers, two of them are found to have non-random system of moving at 5%. This is exactly the number of falsely accepted $H_{A}$, we would expect based on chance.

Thus, finding two keepers not seemingly follow random action does not indeed prove that the do not play mixed strategy. Instead it shows, that for multiple testing, we should need a tigher confidence interval, maybe 1% or 0.05%. With such a tight CI set for multiple testing, we could not reject random action for either goalkeeper.

reference

Image

Selection bias – a war story

2020-07-15T00:00:00+00:00

During the Second World War, in a secret Manhattan building, statisticians and mathematicians were recruited from across the U.S.A. to carry out data analysis in order to save lives and help the U.S.A. win the war. One of the brightest of them was Abraham Wald, an emigré from Nazi occupied Central Europe.

Wald im action

One of Abraham Wald’s tasks was to figure out where to put armor on fighter airplanes. A dataset was put together on airplanes from previous combats with bullet holes shown on various parts of the planes. For example, wings were hit in x% of the cases, vertical stabilizers in y%, etc. The data showed a very low percentage for engines: hardly any of the airplanes had their engines hit by bullets.

Examining this data, Wald came to the conclusion that the extra armor should be fitted to protect engines. His superiors were stupefied: engines were in fact the least affected by bullets. But Wald’s argument convinced them in the end. What was that argument?

Wald realized that the data did not represent all fighter airplanes that fought combats. It included only airplanes that returned from combat so that their bodies could be examined for bullet holes. But the most important observations would have been those that did not return from combat: those were the airplanes that could have benefited the most from the extra armor.

Engines of selection

The fact that the airplanes that returned had no bullet holes in their engines could have had two reasons: engines were not hit, or the airplanes whose engines were hit were missing from the data. As the engines were exposed to bullets and had large enough surface the first reason was unlikely. In contrast, the second reason was very likely: airplanes whose engines were hit were lost almost certainly, while many of the airplanes whose other parts were hit were able come back. Therefore, although they were not in the data, there were airplanes whose engines were hit, and they suffered the most. Putting armor on the engines had the potential to save the most airplanes.

The essence of the issue here was sample selection: the sample of airplanes that returned from combat was not a representative sample of all airplanes that participated in the combats. Instead it was affected by severe selection bias: only surviving planes made it to the sample.

References

This is a well known story, told in several places. Source of the image: McGeddon from Wikipedia

Some history of sampling

2020-07-14T00:00:00+00:00

Statistical enumerations of land, people, and property have taken place in many of the better organized empires and states since Babylonian times. These all aimed to cover entire populations. The first recorded use of a sample to learn something about a population was John Graunt’s estimate of the population of London in 1662.

Early methods of sampling

Gaunt found that in a subset of parishes with good records, 3 burials took place on average per year for 11 families, and average family size was about 8. He used these estimates together with a already known sum: the total number of burials in London. The result of his estimation was a total population of 384,000. Of course, as Graunt himself recognized, the merits of this exercise rest on the quality of all ingredients in the calculation (burials per family, persons per family, total number of burials). In modern language, the first two are questions of representation: were those parishes that provided the estimates representative of all parishes? There was no way to answer that question with the data Graunt had, and the sampling method he had to use (parishes with good data) leaves room for severe selection bias.

The first publication of a sampling method that may be considered random was by Anders Kiaer, founder of Statistics Norway, in 1895. It described a data collection exercise using a sample that was selected using a complex and detailed rule, with a lot of stratification in the beginning but randomness in the end, to estimate statistics about the entire population. Most contemporary statisticians did not accept those estimates as they were not from the entire population. Kiaer also gave little detail on the random part of the selection; neither did he offer a systematic justification for why his method should work.

Fisher and Neyman

It took fifty years and many other statisticians, including Ronald A. Fisher and Jerzy Neyman, to develop a statistical theory that could be used to evaluate estimation from random samples. That theory convinced many statisticians that random samples may work after all, and many started to use relatively small random samples to estimate population statistics. The success of those examples, then, convinced users of statistics of the enormous value of random samples. Random samples proved to help learn orders of magnitude more about a population from the same budget than the previously used complete enumeration methods.

The initial distrust in random sampling is not that surprising. It is always better to have information on entire populations than samples. Perhaps one may think that we would need samples of a very large proportion, say, 50% or 20%, to get accurate estimates for the population – in which case the cost savings of sampling would not be that high. The fact that samples of a few thousand observations may give us a very good picture of entire populations is crucial for the practical value of random samples. It is also quite surprising.

There is something counterintuitive in the fact that it is the size of the sample that matters, not its proportion of the population. But it is true nevertheless. Leaving selection to a random rule may also run against our instincts. It amounts to giving up control over the process – and, as humans, we often think that we make better decisions when we are in control. In the end, it was the demonstrated success of random samples that showed our intuition wrong, and, together with the mathematical statistical theory, convinced most users of the merits of relatively small random samples.

Literary Digest debacle

One influential example was the 1936 poll conducted by the then popular magazine Literary Digest to predict whether the incumbent Democratic president Franklin D. Roosevelt or Alfred Landon, the Republican candidate, would win the presidential election. The Literary Digest asked 10 million people and received answers from over 2 million, by far the largest poll conducted to date. The poll predicted a landslide victory for Alfred Landon: 57% against Roosevelt’s 43%.

The actual results were 62% for Roosevelt versus 38% for Landon. The 2 million strong Literary Digest sample was not a random subsample of the entire electorate: the 10 million invitees were its own readers, augmented by a sample from the telephone directory (back then, only more affluent Americans had a telephone). Moreover, the 2 million that replied were not a random subsample of the 10 million.

At the same time, a substantially smaller but more representative sample conducted by a small company just started by George Gallup correctly predicted a victory for Roosevelt.

References

Peverille Squire: Why the 1936 literary digest poll failed, Public Opinion Quarterly, 1988, 52, 125:133. LINK

Variants of random sampling

2020-07-13T00:00:00+00:00

There are many variations on random sampling that aim to further improve representation or reduce the costs of data collection.

3 Variants of random sampling

Stratified sampling improves on the representative nature of a sample. It starts by specifying large groups in the population, and it applies random sampling within each group.

Cluster sampling starts by specifying small groups, takes a random sample from each of those groups, and collects data from all observations within each selected group. It can greatly reduce data collection cost, for example for in-person interviews that require a lot of traveling. In exchange, cluster samples increase the uncertainty in the data because the number of clusters is a lot smaller than the number of people, thus the sample size is smaller from the point of view of representation.

Multi-stage sampling mitigates this problem by selecting a somewhat larger number of clusters in a first stage and taking a random sample of observations within each cluster instead of including all observations. Some data collection requirements present specific circumstances that may call for further variations on random sampling.

One example for stratified sampling comes from the Comparing online and offline prices case study. People collecting data received instructions to sample a specified number of products (between 10 and 50) in each store, in a stratified manner (by department). To what extent the resulting samples ended up representing the population is an open question here. However, if the purpose of the analysis of comparing online and offline prices, it is this difference that has to be similar in the sample and in the population of all products. As long as these distributions are similar, other differences are not that important.