Jekyll2021-07-16T13:25:17+00:00https://gabors-data-analysis.com/feed.xmlGabors Data AnalysisA comprehensive textbook on data analysis for business, applied economics and public policy students, that uses case studies with real-world data.Gábor Békés and Gábor KézdiA simplified notation for OLS regression2021-07-11T00:00:00+00:002021-07-11T00:00:00+00:00https://gabors-data-analysis.com/posts/2021/07/ols-notation<h2 id="a-simplified-regression-notation">A simplified regression notation</h2>
<p>We introduced some new notation in the textbook, to make the formulae simpler and more focused. In particular, our formula for regressions is slightly different from the traditional formula. We think that it is good practice to write out the formula for each regression that is analyzed. For this reason, it important to use a notation for the regression formula that is as simple as possible and focuses only on what we care about. Our notation is intuitive, but it’s slightly different from traditional practice. Let us explain our reasons.</p>
<p>Our approach starts with the definition of the regression: it is a model for the conditional mean. The formulaic definition of the simple linear regression is \(E[y \mid x]= \alpha + \beta x\) .</p>
<p>The formulaic definition of a linear regression with three right-hand-side variables is</p>
\[E[y \mid x_1, x_2, x_3]= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3\]
<p>The regression formula we use in the textbook is a simplified version of this formulaic definition. In particular, we have \(y^E\) on the left-hand side instead of \(E[y \mid ...]\). So \(y^E\) is just a shorthand for the expected value of \(y\) conditional on whatever is on the right-hand side of the regression.</p>
<p>Thus, the formula for the simple linear regression is \(y^E = \alpha + \beta x\), and \(y^E\) is the expected value of \(y\) conditional on \(x\). The formula for the linear regression with three right-hand-side variables is</p>
\[y^E= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3\]
<p>and here \(y^E\) is the expected value of \(y\) conditional on \(x_1\), \(x_2\), and \(x_3\). Having \(y^E\) on the left-hand side makes notation much simpler than writing out the conditional expectation formula \(E[y \mid ...]\), especially when we have many right-hand-side variables.</p>
<p>In contrast, the traditional regression formula has the variable \(y\) itself on the left-hand side, and has an additional element, the error term. This different notation acknowledges the fact that the actual value of \(y\) is equal to its expected value (defined by the model) plus a deviation from it. For example, the traditional formula for the linear regression with three right-hand-side variables is</p>
\[y= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + e\]
<p>Importantly, it does not imply, that the model is about the conditional mean.</p>
<p>Our notation is simpler, because it has fewer elements. More importantly, our notation makes it explicit that the regression is a model for the conditional mean. It focuses on the data that analysts care about (the right-hand-side variables and their coefficients), without adding anything else. Especially for introductory courses, we believe, it’s a better way.</p>
<p>Note that there are reasons to go beyond this notations. With causal analysis, we may come back to the traditional model and have a term capturing unobserved heterogeneity (often denoted by u not e.) For instance, when introducing potential outcomes, it may be useful. For prediction, we may also have a model with explicitly having a prediction error: the difference between actual and predicted value for a given \(y_i\).</p>
<p>But, well, for most uses of a regression model, we suggest building on the expected value as a starting point to push the idea of regression as a model of conditional mean, and using our shorthand to simplify notation.</p>Gábor Békés and Gábor KézdiA simplified regression notationOn picking the Viridis color scheme2020-10-18T00:00:00+00:002020-10-18T00:00:00+00:00https://gabors-data-analysis.com/posts/2020/10/viridis<h2 id="the-economist-meets-visual-arts">The economist meets visual arts</h2>
<p>About two years ago, we had to make final decision about how the graphs in the textbook will loook like. I am an economist, and well, I know about ten colors. I have had no idea what a <em>hue</em> means. I did not really know what a color scheme was.</p>
<p>Then, two things happen. First TV shows about interior design with their discussion of color schemes. My fav show has been <a href="https://www.bbc.co.uk/programmes/b04nj4d5" target="_blank">BBC’s The Great Interior Design Challenge</a>. This is when I first learnt that what I color blueish green is actually <em>teal</em>. And that there sooo many different shades of a color. Heard about the idea of color palettes and schemes, too.</p>
<p>Second, we had a serious discussion with Gabor K - he is partially color blind (as are about 6% of population) on what he may and may not see. Red and green, not an option. So we needed to decide on how we use colors, and we were set to pick two good looking, color-blind friendly colors. To get started, I talked to my dataviz mentor, <a href="http://albertocairo.com/" target="_blank">Alberto Cairo</a>, who showed me a great application, called <a href="https://colororacle.org/" target="_blank">Color Oracle</a> that will pinpoint if an image works for color blind folks.</p>
<p>Then, I visited my editors in Cambridge, and I also met technical people, who told me that there are no more 2 or 3-color books, it is either a single color or full color. So if we wanna have colors, we can have as many colors as we’d like. Awesome.</p>
<p>So the summary of all this is that we realized, we can have <em>many</em> colors, but not <em>any</em> colors. And we need a scheme. So we tried to pin down our needs:</p>
<ol>
<li>For some graphs, we would need many colors not just a few</li>
<li>We want to use colors to represent values on a less-more (few-many) scale</li>
<li>Color blind folks should see them all right</li>
<li>Bonus: a black and white copy shall work, ie light and dark instead of some colors.</li>
</ol>
<h2 id="viridis-to-the-rescue">Viridis to the rescue</h2>
<p>I searched around untill I found a wonderful <a href="https://www.youtube.com/watch?v=xAoljeRJ3lU" target="_blank">video on Viridis</a>. This is a conference talk at SciPy2015 by <a href="https://bids.berkeley.edu/people/nathaniel-smith" target="_blank">Nathaniel Smith</a> and <a href="https://bids.berkeley.edu/people/st%C3%A9fan-van-der-walt" target="_blank">Stéfan van der Walt</a>. In less than 20 minutes, it explained how vision works, what could go wrong and why viridis is a solution. Viridis is a scheme first developed for Matplotlib in Python, but now available in R and Stata as well.</p>
<p><img src="/images/viridis.png" alt="Viridis" /></p>
<p>When we needes a set of pre-defined colors for scatterplots or line graphs we picked 5 colors from the scheme (here with their HEX code)</p>
<p><img src="/images/5-colors.png" alt="Viridis 5 colors" /></p>
<p>And, well, that is basically it. We then went on and had the textbook key colors be also based on Viridis. But that will be another post…</p>
<h1 id="reference">reference</h1>
<p><a href="https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html" target="_blank">The viridis color palette in R</a> by Bob Rudis, Noam Ross and Simon Garnier
<a href="https://bids.github.io/colormap/" target="_blank">Matplotlib colormaps</a>, Option “D”.</p>Gábor Békés and Gábor KézdiThe economist meets visual artsGoalkeepers, random play and multiple testing2020-07-18T00:00:00+00:002020-07-18T00:00:00+00:00https://gabors-data-analysis.com/posts/2020/07/multiple-test-soccer<p>Random action may be a well thought out strategy. Sometimes you would want to act randomly to make your next action hard to predict. This is particularly important when there is an adverserial situation, and players must choose a strategy. In game theory, this idea of picking an action from a set randomly is called <a href="https://en.wikipedia.org/wiki/Strategy_(game_theory)">mixed stratgy</a> One such case is saving penalties.</p>
<p>How can we test if goalkeepers play mixed strategies? Well, we can test if they move randomly, and test randomness for a large set of goalkeepers.</p>
<h2 id="stats-and-the-beautiful-game">Stats and the beautiful game</h2>
<p>Testing the hypothesis of random action across goalkeepers is what Ignacio Palacios-Huerta carried out in his wonderful book (<a href="https://press.princeton.edu/books/paperback/9780691169255/beautiful-game-theory">Beautiful Game Theory: How Soccer Can Help Economics</a>.</p>
<p><img src="/images/messi_penalty_kick2.jpg" alt="image" style="float: left" /></p>
<p>In the very first chapter, the author looked at penalties, from the point of view of randomness. Goalkeepers often decide to move before the ball is shot. Having collected data on where the goalkeeper moved, he tests if moves (such as left and up or right and down) are random. Random action is mixed strategy. For 40 keepers, two of them are found to have non-random system of moving at 5%. This is exactly the number of falsely accepted $H_{A}$, we would expect based on chance.</p>
<p>Thus, finding two keepers not seemingly follow random action does not indeed prove that the do not play mixed strategy. Instead it shows, that for multiple testing, we should need a tigher confidence interval, maybe 1% or 0.05%. With such a tight CI set for multiple testing, we could not reject random action for either goalkeeper.</p>
<h1 id="reference">reference</h1>
<p><a href="https://commons.wikimedia.org/wiki/File:FWC_2018_-_Group_D_-_ARG_v_ISL_-_Messi_penalty_kick.jpg">Image</a></p>Gábor Békés and Gábor KézdiRandom action may be a well thought out strategy. Sometimes you would want to act randomly to make your next action hard to predict. This is particularly important when there is an adverserial situation, and players must choose a strategy. In game theory, this idea of picking an action from a set randomly is called mixed stratgy One such case is saving penalties.Selection bias – a war story2020-07-15T00:00:00+00:002020-07-15T00:00:00+00:00https://gabors-data-analysis.com/posts/2020/07/selection-war<p>During the Second World War, in a secret Manhattan building, statisticians and mathematicians were recruited from across the U.S.A. to carry out data analysis in order to save lives and help the U.S.A. win the war. One of the brightest of them was <a href="https://en.wikipedia.org/wiki/Abraham_Wald">Abraham Wald</a>, an emigré from Nazi occupied Central Europe.</p>
<h2 id="wald-im-action">Wald im action</h2>
<p>One of Abraham Wald’s tasks was to figure out where to put armor on fighter airplanes. A dataset was put together on airplanes from previous combats with bullet holes shown on various parts of the planes. For example, wings were hit in x% of the cases, vertical stabilizers in y%, etc. The data showed a very low percentage for engines: hardly any of the airplanes had their engines hit by bullets.</p>
<p><img src="/images/survivorship-bias.png" alt="image" style="float: left" /></p>
<p>Examining this data, Wald came to the conclusion that the extra armor should be fitted to protect engines. His superiors were stupefied: engines were in fact the least affected by bullets. But Wald’s argument convinced them in the end. What was that argument?</p>
<p>Wald realized that the data did not represent all fighter airplanes that fought combats. It included only airplanes that returned from combat so that their bodies could be examined for bullet holes. But the most important observations would have been those that did not return from combat: those were the airplanes that could have benefited the most from the extra armor.</p>
<h2 id="engines-of-selection">Engines of selection</h2>
<p>The fact that the airplanes that returned had no bullet holes in their engines could have had two reasons: engines were not hit, or the airplanes whose engines were hit were missing from the data. As the engines were exposed to bullets and had large enough surface the first reason was unlikely. In contrast, the second reason was very likely: airplanes whose engines were hit were lost almost certainly, while many of the airplanes whose other parts were hit were able come back. Therefore, although they were not in the data, there were airplanes whose engines were hit, and they suffered the most. Putting armor on the engines had the potential to save the most airplanes.</p>
<p>The essence of the issue here was <strong>sample selection</strong>: the sample of airplanes that returned from combat was not a representative sample of all airplanes that participated in the combats. Instead it was affected by severe selection bias: only surviving planes made it to the sample.</p>
<h2 id="references">References</h2>
<p>This is a well known story, told in several places. Source of the image: McGeddon from <a href="https://en.wikipedia.org/wiki/Survivorship_bias">Wikipedia
</a></p>Gábor Békés and Gábor KézdiDuring the Second World War, in a secret Manhattan building, statisticians and mathematicians were recruited from across the U.S.A. to carry out data analysis in order to save lives and help the U.S.A. win the war. One of the brightest of them was Abraham Wald, an emigré from Nazi occupied Central Europe.Some history of sampling2020-07-14T00:00:00+00:002020-07-14T00:00:00+00:00https://gabors-data-analysis.com/posts/2020/07/history-sampling<p>Statistical enumerations of land, people, and property have taken place in many of the better organized empires and states since Babylonian times. These all aimed to cover entire populations. The first recorded use of a sample to learn something about a population was John Graunt’s estimate of the population of London in 1662.</p>
<h2 id="early-methods-of-sampling">Early methods of sampling</h2>
<p>Gaunt found that in a subset of parishes with good records, 3 burials took place on average per year for 11 families, and average family size was about 8. He used these estimates together with a already known sum: the total number of burials in London. The result of his estimation was a total population of 384,000. Of course, as <a href="https://www.britannica.com/biography/John-Graunt">Graunt</a> himself recognized, the merits of this exercise rest on the quality of all ingredients in the calculation (burials per family, persons per family, total number of burials). In modern language, the first two are questions of representation: were those parishes that provided the estimates representative of all parishes? There was no way to answer that question with the data Graunt had, and the sampling method he had to use (parishes with good data) leaves room for severe selection bias.</p>
<p>The first publication of a sampling method that may be considered random was by Anders Kiaer, founder of Statistics Norway, in 1895. It described a data collection exercise using a sample that was selected using a complex and detailed rule, with a lot of stratification in the beginning but randomness in the end, to estimate statistics about the entire population. Most contemporary statisticians did not accept those estimates as they were not from the entire population. Kiaer also gave little detail on the random part of the selection; neither did he offer a systematic justification for why his method should work.</p>
<h2 id="fisher-and-neyman">Fisher and Neyman</h2>
<p>It took fifty years and many other statisticians, including Ronald A. Fisher and Jerzy Neyman, to develop a statistical theory that could be used to evaluate estimation from random samples. That theory convinced many statisticians that random samples may work after all, and many started to use relatively small random samples to estimate population statistics. The success of those examples, then, convinced users of statistics of the enormous value of random samples. Random samples proved to help learn orders of magnitude more about a population from the same budget than the previously used complete enumeration methods.</p>
<p>The initial distrust in random sampling is not that surprising. It is always better to have information on entire populations than samples. Perhaps one may think that we would need samples of a very large proportion, say, 50\% or 20\%, to get accurate estimates for the population – in which case the cost savings of sampling would not be that high. The fact that samples of a few thousand observations may give us a very good picture of entire populations is crucial for the practical value of random samples. It is also quite surprising. There is something counterintuitive in the fact that it is the size of the sample that matters, not its proportion of the population. But it is true nevertheless. Leaving selection to a random rule may also run against our instincts. It amounts to giving up control over the process – and, as humans, we often think that we make better decisions when we are in control. In the end, it was the demonstrated success of random samples that showed our intuition wrong, and, together with the mathematical statistical theory, convinced most users of the merits of relatively small random samples.</p>
<h2 id="literary-digest-debacle">Literary Digest debacle</h2>
<p>One influential example was the 1936 poll conducted by the then popular magazine Literary Digest to predict whether the incumbent Democratic president Franklin D. Roosevelt or Alfred Landon, the Republican candidate, would win the presidential election. The Literary Digest asked 10 million people and received answers from over 2 million, by far the largest poll conducted to date. The poll predicted a landslide victory for Alfred Landon: 57\% against Roosevelt’s 43\%. The actual results were 62\% for Roosevelt versus 38\% for Landon. The 2 million strong Literary Digest sample was not a random subsample of the entire electorate: the 10 million invitees were its own readers, augmented by a sample from the telephone directory (back then, only more affluent Americans had a telephone). Moreover, the 2 million that replied were not a random subsample of the 10 million. At the same time, a substantially smaller but more representative sample conducted by George Gallup correctly predicted a victory for Roosevelt.</p>
<h2 id="references">References</h2>
<p>Peverille Squire: Why the 1936 literary digest poll failed, <em>Public Opinion Quarterly, 1988, 52, 125:133</em>. <a href="https://www.jstor.org/stable/2749114?seq=1#page_scan_tab_contents">LINK</a></p>Gábor Békés and Gábor KézdiStatistical enumerations of land, people, and property have taken place in many of the better organized empires and states since Babylonian times. These all aimed to cover entire populations. The first recorded use of a sample to learn something about a population was John Graunt’s estimate of the population of London in 1662.Comparing data from two different sources2020-07-14T00:00:00+00:002020-07-14T00:00:00+00:00https://gabors-data-analysis.com/posts/2021/05/compare-sources<p>Suppose you have data, but with many missing observations. You find a new data source that overlaps but has the potential for new observations or new values. How best compare them?</p>
<h2 id="comparing-firm-info">Comparing firm info</h2>
<p>We were using a particular firm level dataset. It has several problems, some firms are missing, and there are many missing values, too.
We have just found some new balance sheet information on firms. So we have two sources, call them <strong>A</strong> and <strong>B</strong>. They overlap, but the union of them would enrich the data we currently have</p>
<p>Take variable x, say, sales turnover from both sources, for a given year, say 2015.</p>
<p>xs= sales(A)/ sales(B), for non missing and non zero values - ie those that really overlap.</p>
<p>Create disjunct bins for xs, and a table, with two columns:</p>
<ol>
<li>Count of obs in bin</li>
<li>Mean of xs in the bin.</li>
</ol>
<p>Bins
=1
0.952-1.05 but not 1
0.8-0.952
1.05-1.25
0.5-0.8
1.25-2
<0.5</p>
<blockquote>
<p>2</p>
</blockquote>Gábor Békés and Gábor KézdiSuppose you have data, but with many missing observations. You find a new data source that overlaps but has the potential for new observations or new values. How best compare them?Variants of random sampling2020-07-13T00:00:00+00:002020-07-13T00:00:00+00:00https://gabors-data-analysis.com/posts/2020/07/sampling-variants<p>There are many variations on random sampling that aim to further improve representation or reduce the costs of data collection.</p>
<h2 id="3-variants-of-random-sampling">3 Variants of random sampling</h2>
<p><strong>Stratified sampling</strong> improves on the representative nature of a sample. It starts by specifying large groups in the population, and it applies random sampling within each group.</p>
<p><strong>Cluster sampling</strong> starts by specifying small groups, takes a random sample from each of those groups, and collects data from all observations within each selected group. It can greatly reduce data collection cost, for example for in-person interviews that require a lot of traveling. In exchange, cluster samples increase the uncertainty in the data because the number of clusters is a lot smaller than the number of people, thus the sample size is smaller from the point of view of representation.</p>
<p><strong>Multi-stage sampling</strong> mitigates this problem by selecting a somewhat larger number of clusters in a first stage and taking a random sample of observations within each cluster instead of including all observations. Some data collection requirements present specific circumstances that may call for further variations on random sampling.</p>
<p>One example for stratified sampling comes from the <code class="language-plaintext highlighter-rouge">Comparing online and offline prices</code> case study. People collecting data received instructions to sample a specified number of products (between 10 and 50) in each store, in a stratified manner (by department). To what extent the resulting samples ended up representing the population is an open question here. However, if the purpose of the analysis of comparing online and offline prices, it is this difference that has to be similar in the sample and in the population of all products. As long as these distributions are similar, other differences are not that important.</p>Gábor Békés and Gábor KézdiThere are many variations on random sampling that aim to further improve representation or reduce the costs of data collection.ATE vs ATET: when should be care about them?2020-07-13T00:00:00+00:002020-07-13T00:00:00+00:00https://gabors-data-analysis.com/posts/2021/03/ate-vs-atet<p>In Chapters 19, 21, and 22 we talk about both the average treatment effect (ATE) and the average treatment effect on the treated (ATET). We discuss how to compute it in some cases and how to interpret them. In this short piece let us discuss two points scattered in the book: which one should we care about and what is the difference in terms of computation. </p>
<h2 id="which-one-shall-we-care-about">Which one shall we care about?</h2>
<p> Whether ATE or ATET is the more policy relevant one depends on the
situation. Who the non-treated are who are included in ATE but not
ATET, and will the implementation you really care about have the same
kind of individuals left out from ATET.
Your slides say “gains from the intervention for those subject it was
intended to, and can be compared with its costs” well that may not be
true, ATET is different when there is selection, maybe that selection
is different in the data than what you want it to be in the
implementation you care about (e.g., want to introduce a compulsory
program, data compares implementation for some subjects comparing
other subjects for whom it was not implemented yet). You may also care
about spillovers on non-treated ones.</p>
<h2 id="matching">Matching</h2>
<p>This is explained in detail in Chapter 21. Basically, (just copy text)
“When unweighted, this is an estimate of the average treatment effect on the treated (ATET), because each difference within the average is computed for each and
every observation with x = 1. If the weights include the number of x = 0 observations, too, it’s an estimate of the average treatment effect (ATE).”</p>
<h2 id="ols">OLS</h2>
<p>Plus, actually, you can estimate ATET by OLS in a way that is
analogous to ATET vs ATE with matching: reweight observations by the
distribution of confounders in the treated group. OLS estimates ATE
for the population represented by the data. Weights can make that data
represent the treated ones. With many confounder variables that can be
cumbersome in practice just like exact matching would be. But you can
approximate it by estimating the pscore and using it as a weight. BTW
you can also have the pscore as your only rhs variable besides the
causal variable instead of all the confounders.</p>
<h2 id="diff-in-diffs">Diff-in-diffs</h2>
<p>As explained in Chapter 22,
“Diff-in-diffs gives a good estimate of the average treatment effect on the treated (ATET) if the averagechange among untreated subjects is a good counterfactual to the average change among treated subjects. That happens if the outcome of the treated subjects would have changed, on average, in
the same way as it changed among the untreated subjects, had the intervention not taken place.
In addition, diff-in-diffs gives a good estimate of the overall average treatment effect (ATE) under
two conditions. First, the previously described condition for the ATET is satisfied. Second, if the average
change among treated subjects is a good counterfactual to the average change among untreated
subjects. This second condition is met if the outcome of the untreated subjects would have changed,
on average, the same way as the outcome would have changed for the treated subjects, had they,
too, been treated”</p>
<h2 id="read-more">Read more</h2>
<p>We have not dwelled on details a great deal. You may turn to <a href="link">Imbens</a> for more ideas.</p>Gábor Békés and Gábor KézdiIn Chapters 19, 21, and 22 we talk about both the average treatment effect (ATE) and the average treatment effect on the treated (ATET). We discuss how to compute it in some cases and how to interpret them. In this short piece let us discuss two points scattered in the book: which one should we care about and what is the difference in terms of computation.