Jekyll2022-01-20T17:26:45+00:00https://gabors-data-analysis.com/feed.xmlGabors Data AnalysisA comprehensive textbook on data analysis for business, applied economics and public policy students, that uses case studies with real-world data.Gábor Békés and Gábor KézdiInterpreting a coefficient in a simple OLS regression2021-11-29T00:00:00+00:002021-11-29T00:00:00+00:00https://gabors-data-analysis.com/posts/2021/11/ols-coeff-interpret<h2 id="interpreting-univariate-ols-coefficients">Interpreting univariate OLS coefficients</h2>
<p>Precise interpretation of a simple univariate, cross-sectional regression is not easy.</p>
<p>There are many good (precise) ways to do it, some that are not perfect and some that are not good. So let us offer an example, and a vareity of good, partially ok, but problematic and simply wrong answers. We also add some comments to explain why an asnwer is good or bad.</p>
<h3 id="the-question">The question</h3>
<p>You have a representative sample of 10,000 people in a country, aged 15-45. You are interested in the relationship between earning (USD /per year) and age. You run a simple linear regression estimated with OLS. Both \(y\) and \(x\) are in levels.</p>
\[y^{E} = \hat\alpha + \hat \beta \times age\]
<p>The estimated coefficients are: \(\hat \alpha = 7000, \hat\beta = 400\). The task is to interpret both of these coefficients.</p>
<h3 id="good-answers">Good answers</h3>
<p>Let us start with the constant (intercept) \(\hat\alpha\)</p>
<ul>
<li><strong>For people aged zero (when age=0), earnings is $7000, on average</strong></li>
<li>For people aged zero (when age=0), the expected earning is $7000
<ul>
<li><em>you may or may not add on average, “expected” includes it</em></li>
</ul>
</li>
<li>For people aged zero (when age=0), earning is $7000, on average</li>
<li>The constant cannot be interpreted in this context (because newborns make no money)</li>
<li>Intercept of the regression line, in this case has no realistic meaning, no earnings at age = 0</li>
</ul>
<p>Now let us look at the slope, \(\hat\beta\)</p>
<ul>
<li><strong>People who are one year older, earn $400 more, on average</strong></li>
<li>People who are one year older tend to earn $400 more (you may or may not add on average, “expected” kind of includes it)
<ul>
<li>… are expected to earn $400 more</li>
<li>…tend to have higher earnings by $400</li>
</ul>
</li>
<li>One additional year of age is associated with $400 higher earning, on average</li>
<li>One year age difference is associated with and an average of $400 extra earnings</li>
<li>One additional year in age corresponds to an average of $400 extra earnings</li>
<li>Earnings of people who are one year older, are (tend to be / is expected to be) on average $400 higher in the data</li>
<li>Comparing two people, the one who is one year older, is expected to (tend to have) have $400 higher earning</li>
</ul>
<h3 id="partial-credit-not-completely-bad-but-problematic">Partial credit, not completely bad but problematic</h3>
<p>For constant, \(\hat\alpha\)</p>
<ul>
<li>Newborn/Zero aged people earn $7000 ( missed: on average )</li>
<li>Average values of earning without considering the age is $7000 ( we consider it, but it’s zero )</li>
<li>The person earns 7000usd/year at least, no matter what is his age ( true but only because beta is positive )</li>
<li>7000 is the minimum income that has to be given irrespective of the age ( true but only because beta is positive )</li>
</ul>
<p>For slope, \(\hat\beta\)</p>
<ul>
<li>One additional year in age corresponds to $400 higher earning ( missed: on average )</li>
<li>One extra year means (implies) 400 more ( suggests causality, and missed on average )</li>
<li>any extra age adds up 400 to earnings ( suggests causality, and missed on average )</li>
<li>People who are one year older will have $400 higher earnings, on average ( “will have”: the data is about the past, we don’t know what the future brings. Yes, <em>will</em> can mean “likely” but should be avoided )</li>
</ul>
<h3 id="not-good-bad">Not good (bad)</h3>
<p>For constant, \(\hat\alpha\)</p>
<ul>
<li>The intercept is 7000 ( not interpretation )</li>
<li>Average earning is 7000 if age=15 ( not at the minimum age in the sample )</li>
<li>Average earning is 7000 ( no, 7000 is average earnings at age zero )</li>
</ul>
<p>For slope, \(\hat\beta\)</p>
<ul>
<li>for every unit change in age the change in earning on average is 400USD ( it’s about cross-section differences between people, not changes)</li>
<li>One year increase will get $400 increase in wage ( no time series or causality, no increase! )</li>
<li>Each year in the age increases earnings by $400 ( no time series or causality, no increase! )</li>
<li>The slope is 400 ( not interpretation )</li>
</ul>Gábor Békés and Gábor KézdiInterpreting univariate OLS coefficientsA simplified notation for OLS regression2021-07-11T00:00:00+00:002021-07-11T00:00:00+00:00https://gabors-data-analysis.com/posts/2021/07/ols-notation<h2 id="a-simplified-regression-notation">A simplified regression notation</h2>
<p>We introduced some new notation in the textbook, to make the formulae simpler and more focused. In particular, our formula for regressions is slightly different from the traditional formula. We think that it is good practice to write out the formula for each regression that is analyzed. For this reason, it important to use a notation for the regression formula that is as simple as possible and focuses only on what we care about. Our notation is intuitive, but it’s slightly different from traditional practice. Let us explain our reasons.</p>
<p>Our approach starts with the definition of the regression: it is a model for the conditional mean. The formulaic definition of the simple linear regression is \(E[y \mid x]= \alpha + \beta x\) .</p>
<p>The formulaic definition of a linear regression with three right-hand-side variables is</p>
\[E[y \mid x_1, x_2, x_3]= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3\]
<p>The regression formula we use in the textbook is a simplified version of this formulaic definition. In particular, we have \(y^E\) on the left-hand side instead of \(E[y \mid ...]\). So \(y^E\) is just a shorthand for the expected value of \(y\) conditional on whatever is on the right-hand side of the regression.</p>
<p>Thus, the formula for the simple linear regression is \(y^E = \alpha + \beta x\), and \(y^E\) is the expected value of \(y\) conditional on \(x\). The formula for the linear regression with three right-hand-side variables is</p>
\[y^E= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3\]
<p>and here \(y^E\) is the expected value of \(y\) conditional on \(x_1\), \(x_2\), and \(x_3\). Having \(y^E\) on the left-hand side makes notation much simpler than writing out the conditional expectation formula \(E[y \mid ...]\), especially when we have many right-hand-side variables.</p>
<p>In contrast, the traditional regression formula has the variable \(y\) itself on the left-hand side, and has an additional element, the error term. This different notation acknowledges the fact that the actual value of \(y\) is equal to its expected value (defined by the model) plus a deviation from it. For example, the traditional formula for the linear regression with three right-hand-side variables is</p>
\[y= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + e\]
<p>Importantly, it does not imply, that the model is about the conditional mean.</p>
<p>Our notation is simpler, because it has fewer elements. More importantly, our notation makes it explicit that the regression is a model for the conditional mean. It focuses on the data that analysts care about (the right-hand-side variables and their coefficients), without adding anything else. Especially for introductory courses, we believe, it’s a better way.</p>
<p>Note that there are reasons to go beyond this notations. With causal analysis, we may come back to the traditional model and have a term capturing unobserved heterogeneity (often denoted by u not e.) For instance, when introducing potential outcomes, it may be useful. For prediction, we may also have a model with explicitly having a prediction error: the difference between actual and predicted value for a given \(y_i\).</p>
<p>But, well, for most uses of a regression model, we suggest building on the expected value as a starting point to push the idea of regression as a model of conditional mean, and using our shorthand to simplify notation.</p>Gábor Békés and Gábor KézdiA simplified regression notationOn picking the Viridis color scheme2020-10-18T00:00:00+00:002020-10-18T00:00:00+00:00https://gabors-data-analysis.com/posts/2020/10/viridis<h2 id="the-economist-meets-visual-arts">The economist meets visual arts</h2>
<p>About two years ago, we had to make final decision about how the graphs in the textbook will loook like. I am an economist, and well, I know about ten colors. I have had no idea what a <em>hue</em> means. I did not really know what a color scheme was.</p>
<p>Then, two things happen. First TV shows about interior design with their discussion of color schemes. My fav show has been <a href="https://www.bbc.co.uk/programmes/b04nj4d5" target="_blank">BBC’s The Great Interior Design Challenge</a>. This is when I first learnt that what I color blueish green is actually <em>teal</em>. And that there sooo many different shades of a color. Heard about the idea of color palettes and schemes, too.</p>
<p>Second, we had a serious discussion with Gabor K - he is partially color blind (as are about 6% of population) on what he may and may not see. Red and green, not an option. So we needed to decide on how we use colors, and we were set to pick two good looking, color-blind friendly colors. To get started, I talked to my dataviz mentor, <a href="http://albertocairo.com/" target="_blank">Alberto Cairo</a>, who showed me a great application, called <a href="https://colororacle.org/" target="_blank">Color Oracle</a> that will pinpoint if an image works for color blind folks.</p>
<p>Then, I visited my editors in Cambridge, and I also met technical people, who told me that there are no more 2 or 3-color books, it is either a single color or full color. So if we wanna have colors, we can have as many colors as we’d like. Awesome.</p>
<p>So the summary of all this is that we realized, we can have <em>many</em> colors, but not <em>any</em> colors. And we need a scheme. So we tried to pin down our needs:</p>
<ol>
<li>For some graphs, we would need many colors not just a few</li>
<li>We want to use colors to represent values on a less-more (few-many) scale</li>
<li>Color blind folks should see them all right</li>
<li>Bonus: a black and white copy shall work, ie light and dark instead of some colors.</li>
</ol>
<h2 id="viridis-to-the-rescue">Viridis to the rescue</h2>
<p>I searched around untill I found a wonderful <a href="https://www.youtube.com/watch?v=xAoljeRJ3lU" target="_blank">video on Viridis</a>. This is a conference talk at SciPy2015 by <a href="https://bids.berkeley.edu/people/nathaniel-smith" target="_blank">Nathaniel Smith</a> and <a href="https://bids.berkeley.edu/people/st%C3%A9fan-van-der-walt" target="_blank">Stéfan van der Walt</a>. In less than 20 minutes, it explained how vision works, what could go wrong and why viridis is a solution. Viridis is a scheme first developed for Matplotlib in Python, but now available in R and Stata as well.</p>
<p><img src="/images/viridis.png" alt="Viridis" /></p>
<p>When we needes a set of pre-defined colors for scatterplots or line graphs we picked 5 colors from the scheme (here with their HEX code)</p>
<p><img src="/images/5-colors.png" alt="Viridis 5 colors" /></p>
<p>And, well, that is basically it. We then went on and had the textbook key colors be also based on Viridis. But that will be another post…</p>
<h1 id="reference">reference</h1>
<p><a href="https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html" target="_blank">The viridis color palette in R</a> by Bob Rudis, Noam Ross and Simon Garnier
<a href="https://bids.github.io/colormap/" target="_blank">Matplotlib colormaps</a>, Option “D”.</p>Gábor Békés and Gábor KézdiThe economist meets visual artsGoalkeepers, random play and multiple testing2020-07-18T00:00:00+00:002020-07-18T00:00:00+00:00https://gabors-data-analysis.com/posts/2020/07/multiple-test-soccer<p>Random action may be a well thought out strategy. Sometimes you would want to act randomly to make your next action hard to predict. This is particularly important when there is an adverserial situation, and players must choose a strategy. In game theory, this idea of picking an action from a set randomly is called <a href="https://en.wikipedia.org/wiki/Strategy_(game_theory)">mixed stratgy</a> One such case is saving penalties.</p>
<p>How can we test if goalkeepers play mixed strategies? Well, we can test if they move randomly, and test randomness for a large set of goalkeepers.</p>
<h2 id="stats-and-the-beautiful-game">Stats and the beautiful game</h2>
<p>Testing the hypothesis of random action across goalkeepers is what Ignacio Palacios-Huerta carried out in his wonderful book (<a href="https://press.princeton.edu/books/paperback/9780691169255/beautiful-game-theory">Beautiful Game Theory: How Soccer Can Help Economics</a>.</p>
<p><img src="/images/messi_penalty_kick2.jpg" alt="image" style="float: left" /></p>
<p>In the very first chapter, the author looked at penalties, from the point of view of randomness. Goalkeepers often decide to move before the ball is shot. Having collected data on where the goalkeeper moved, he tests if moves (such as left and up or right and down) are random. Random action is mixed strategy. For 40 keepers, two of them are found to have non-random system of moving at 5%. This is exactly the number of falsely accepted $H_{A}$, we would expect based on chance.</p>
<p>Thus, finding two keepers not seemingly follow random action does not indeed prove that the do not play mixed strategy. Instead it shows, that for multiple testing, we should need a tigher confidence interval, maybe 1% or 0.05%. With such a tight CI set for multiple testing, we could not reject random action for either goalkeeper.</p>
<h1 id="reference">reference</h1>
<p><a href="https://commons.wikimedia.org/wiki/File:FWC_2018_-_Group_D_-_ARG_v_ISL_-_Messi_penalty_kick.jpg">Image</a></p>Gábor Békés and Gábor KézdiRandom action may be a well thought out strategy. Sometimes you would want to act randomly to make your next action hard to predict. This is particularly important when there is an adverserial situation, and players must choose a strategy. In game theory, this idea of picking an action from a set randomly is called mixed stratgy One such case is saving penalties.Selection bias – a war story2020-07-15T00:00:00+00:002020-07-15T00:00:00+00:00https://gabors-data-analysis.com/posts/2020/07/selection-war<p>During the Second World War, in a secret Manhattan building, statisticians and mathematicians were recruited from across the U.S.A. to carry out data analysis in order to save lives and help the U.S.A. win the war. One of the brightest of them was <a href="https://en.wikipedia.org/wiki/Abraham_Wald">Abraham Wald</a>, an emigré from Nazi occupied Central Europe.</p>
<h2 id="wald-im-action">Wald im action</h2>
<p>One of Abraham Wald’s tasks was to figure out where to put armor on fighter airplanes. A dataset was put together on airplanes from previous combats with bullet holes shown on various parts of the planes. For example, wings were hit in x% of the cases, vertical stabilizers in y%, etc. The data showed a very low percentage for engines: hardly any of the airplanes had their engines hit by bullets.</p>
<p><img src="/images/survivorship-bias.png" alt="image" style="float: left" /></p>
<p>Examining this data, Wald came to the conclusion that the extra armor should be fitted to protect engines. His superiors were stupefied: engines were in fact the least affected by bullets. But Wald’s argument convinced them in the end. What was that argument?</p>
<p>Wald realized that the data did not represent all fighter airplanes that fought combats. It included only airplanes that returned from combat so that their bodies could be examined for bullet holes. But the most important observations would have been those that did not return from combat: those were the airplanes that could have benefited the most from the extra armor.</p>
<h2 id="engines-of-selection">Engines of selection</h2>
<p>The fact that the airplanes that returned had no bullet holes in their engines could have had two reasons: engines were not hit, or the airplanes whose engines were hit were missing from the data. As the engines were exposed to bullets and had large enough surface the first reason was unlikely. In contrast, the second reason was very likely: airplanes whose engines were hit were lost almost certainly, while many of the airplanes whose other parts were hit were able come back. Therefore, although they were not in the data, there were airplanes whose engines were hit, and they suffered the most. Putting armor on the engines had the potential to save the most airplanes.</p>
<p>The essence of the issue here was <strong>sample selection</strong>: the sample of airplanes that returned from combat was not a representative sample of all airplanes that participated in the combats. Instead it was affected by severe selection bias: only surviving planes made it to the sample.</p>
<h2 id="references">References</h2>
<p>This is a well known story, told in several places. Source of the image: McGeddon from <a href="https://en.wikipedia.org/wiki/Survivorship_bias">Wikipedia
</a></p>Gábor Békés and Gábor KézdiDuring the Second World War, in a secret Manhattan building, statisticians and mathematicians were recruited from across the U.S.A. to carry out data analysis in order to save lives and help the U.S.A. win the war. One of the brightest of them was Abraham Wald, an emigré from Nazi occupied Central Europe.Some history of sampling2020-07-14T00:00:00+00:002020-07-14T00:00:00+00:00https://gabors-data-analysis.com/posts/2020/07/history-sampling<p>Statistical enumerations of land, people, and property have taken place in many of the better organized empires and states since Babylonian times. These all aimed to cover entire populations. The first recorded use of a sample to learn something about a population was John Graunt’s estimate of the population of London in 1662.</p>
<h2 id="early-methods-of-sampling">Early methods of sampling</h2>
<p>Gaunt found that in a subset of parishes with good records, 3 burials took place on average per year for 11 families, and average family size was about 8. He used these estimates together with a already known sum: the total number of burials in London. The result of his estimation was a total population of 384,000. Of course, as <a href="https://www.britannica.com/biography/John-Graunt">Graunt</a> himself recognized, the merits of this exercise rest on the quality of all ingredients in the calculation (burials per family, persons per family, total number of burials). In modern language, the first two are questions of representation: were those parishes that provided the estimates representative of all parishes? There was no way to answer that question with the data Graunt had, and the sampling method he had to use (parishes with good data) leaves room for severe selection bias.</p>
<p>The first publication of a sampling method that may be considered random was by Anders Kiaer, founder of Statistics Norway, in 1895. It described a data collection exercise using a sample that was selected using a complex and detailed rule, with a lot of stratification in the beginning but randomness in the end, to estimate statistics about the entire population. Most contemporary statisticians did not accept those estimates as they were not from the entire population. Kiaer also gave little detail on the random part of the selection; neither did he offer a systematic justification for why his method should work.</p>
<h2 id="fisher-and-neyman">Fisher and Neyman</h2>
<p>It took fifty years and many other statisticians, including Ronald A. Fisher and Jerzy Neyman, to develop a statistical theory that could be used to evaluate estimation from random samples. That theory convinced many statisticians that random samples may work after all, and many started to use relatively small random samples to estimate population statistics. The success of those examples, then, convinced users of statistics of the enormous value of random samples. Random samples proved to help learn orders of magnitude more about a population from the same budget than the previously used complete enumeration methods.</p>
<p><img src="/images/neymanfisher.jpg" alt="image" style="float: center" /></p>
<p>The initial distrust in random sampling is not that surprising. It is always better to have information on entire populations than samples. Perhaps one may think that we would need samples of a very large proportion, say, 50% or 20%, to get accurate estimates for the population – in which case the cost savings of sampling would not be that high. The fact that samples of a few thousand observations may give us a very good picture of entire populations is crucial for the practical value of random samples. It is also quite surprising.</p>
<p>There is something counterintuitive in the fact that it is the size of the sample that matters, not its proportion of the population. But it is true nevertheless. Leaving selection to a random rule may also run against our instincts. It amounts to giving up control over the process – and, as humans, we often think that we make better decisions when we are in control. In the end, it was the demonstrated success of random samples that showed our intuition wrong, and, together with the mathematical statistical theory, convinced most users of the merits of relatively small random samples.</p>
<h2 id="literary-digest-debacle">Literary Digest debacle</h2>
<p>One influential example was the 1936 poll conducted by the then popular magazine Literary Digest to predict whether the incumbent Democratic president Franklin D. Roosevelt or Alfred Landon, the Republican candidate, would win the presidential election. The Literary Digest asked 10 million people and received answers from over 2 million, by far the largest poll conducted to date. The poll predicted a landslide victory for Alfred Landon: 57% against Roosevelt’s 43%.</p>
<p>The actual results were 62% for Roosevelt versus 38% for Landon. The 2 million strong Literary Digest sample was not a random subsample of the entire electorate: the 10 million invitees were its own readers, augmented by a sample from the telephone directory (back then, only more affluent Americans had a telephone). Moreover, the 2 million that replied were not a random subsample of the 10 million.</p>
<p>At the same time, a substantially smaller but more representative sample conducted by a small company just started by <a href="https://en.wikipedia.org/wiki/George_Gallup">George Gallup</a> correctly predicted a victory for Roosevelt.</p>
<h2 id="references">References</h2>
<p>Peverille Squire: Why the 1936 literary digest poll failed, <em>Public Opinion Quarterly, 1988, 52, 125:133</em>. <a href="https://www.jstor.org/stable/2749114?seq=1#page_scan_tab_contents">LINK</a></p>Gábor Békés and Gábor KézdiStatistical enumerations of land, people, and property have taken place in many of the better organized empires and states since Babylonian times. These all aimed to cover entire populations. The first recorded use of a sample to learn something about a population was John Graunt’s estimate of the population of London in 1662.Variants of random sampling2020-07-13T00:00:00+00:002020-07-13T00:00:00+00:00https://gabors-data-analysis.com/posts/2020/07/sampling-variants<p>There are many variations on random sampling that aim to further improve representation or reduce the costs of data collection.</p>
<h2 id="3-variants-of-random-sampling">3 Variants of random sampling</h2>
<p><strong>Stratified sampling</strong> improves on the representative nature of a sample. It starts by specifying large groups in the population, and it applies random sampling within each group.</p>
<p><strong>Cluster sampling</strong> starts by specifying small groups, takes a random sample from each of those groups, and collects data from all observations within each selected group. It can greatly reduce data collection cost, for example for in-person interviews that require a lot of traveling. In exchange, cluster samples increase the uncertainty in the data because the number of clusters is a lot smaller than the number of people, thus the sample size is smaller from the point of view of representation.</p>
<p><strong>Multi-stage sampling</strong> mitigates this problem by selecting a somewhat larger number of clusters in a first stage and taking a random sample of observations within each cluster instead of including all observations. Some data collection requirements present specific circumstances that may call for further variations on random sampling.</p>
<p>One example for stratified sampling comes from the <code class="language-plaintext highlighter-rouge">Comparing online and offline prices</code> case study. People collecting data received instructions to sample a specified number of products (between 10 and 50) in each store, in a stratified manner (by department). To what extent the resulting samples ended up representing the population is an open question here. However, if the purpose of the analysis of comparing online and offline prices, it is this difference that has to be similar in the sample and in the population of all products. As long as these distributions are similar, other differences are not that important.</p>Gábor Békés and Gábor KézdiThere are many variations on random sampling that aim to further improve representation or reduce the costs of data collection.