Errors we made and found

Errata to Békés-Kézdi: Data Analysis for Business, Economics, and Policy, Cambrigde University Press, 2021

There are a few errors we made, unfortunately. Some are typos, swapped figure lables, some are imprecise language. It may be that we found an important error in code and corrected it, so the code does not exactly reproduce tables and graphs in the book.

Fortunately, we found some. As we, and our kind readers, carry on finding more errors, we are adding them here. You shall review them before reading / teaching.

If you were to find an error, please report us HERE

List of errors (Latest update: July 2023)

Part I

ID Date added Error Type Chapter Page Problematic Corrected
01-01 2021-10-26 Imprecise Ch01 p.19 If it shows similar distributions then the sample is representative for the variable, or variables, used in the comparison. If it shows similar distributions then the sample is likely to be representative for the variable, or variables, used in the comparison.
01-02 2021-11-06 Imprecise Ch01 p.27 pq4 Give an example of data with selection bias and are without Give an example of data with selection bias and one without it
01-03 2022-06-13 Typo Ch01 p.10 Management Quality and Firm Performance: Data Collection Management Quality and Firm Size: Data Collection
01-04 2022-06-28 Typo Ch02 p.33 the difference between inventories of a chocolate factory at the end of this month and the end of last month is the difference between chocolate production and chocolate sales during last month … the difference between chocolate production and chocolate sales during this month
01-05 2022-06-28 Easier read Ch02 p.34 Observations in xt data are one unit observed in one time period. One observation in an xt data is one unit observed in one time period.
02-01 2022-06-28 Imprecise Ch02 p.55/DE/3 For FIN, there is one lecture for DA1 and DA2, and another lecture for all other programs For both DA1 and DA2 courses, there is one lecture for FIN, and another one for all other programs.
03-01 2021-12-15 Typo Ch03 p75 The range is around 50 dollars in both cities. The range starts around 50 dollars in both cities
03-02 2022-09-22 Imprecise Ch03 p91 The binomial distribution has one mode in the middle, and it is symmetric so its median, mean, and mode are the same The binomial distribution is not symmetric in general. It is symmetrical when \(p=0.5\) (with the mean = the median), but will be skewed to the left or right otherwise.
03-02 2022-09-22 Imprecise Ch03 p96 They range between zero and positive infinity (never reaching either) They range between zero and positive infinity (never reaching exactly 0)
03-02 2022-09-22 Imprecise Ch03 p96    
03-02 2022-09-22 Imprecise Ch03 p96    
04-01 2022-04-20 Typo Ch04 p109 a symmetrical U-shaped conditional expectation has an average of zero a symmetrical U-shaped conditional expectation function has a zero average
04-02 2022-04-20 Easier read Ch04 p109 The more balanced the positive deviation in \(x_i\) and positive deviation in \(y_i\) instances are with the positive deviation in \(x_i\) and negative deviation in \(y_i\) instances, the closer the covariance is to zero. We have the covariance closer to zero, when we have a more balanced ratio of the two types of instances – positive deviation in \(x_i\) and positive deviation in \(y_i\) versus positive deviation in \(x_i\) and negative deviation in \(y_i\)
04-03 2022-04-20 Typo Ch04 p109 Thus, they give a quick and not completely meaningless picture about mean-dependence among binary and ordered qualitative variables. However, they are more appropriate measures for qualitative variables. Thus, they give a quick and not completely meaningless picture about mean-dependence among binary and ordered qualitative variables. However, they are more appropriate measures for quantitative variables.
04-04 2022-04-22 Sentence wrong Ch04 p108 In contrast, larger firms differ more from each other in terms of their management score. (cut)
04-05 2022-04-22 Sentence wrong Ch04 p112-13 Finally, we have seen that management quality is not only better, on average, among larger firms, but it is also somewhat more spread among larger firms. (cut)
04-06 2022-09-22 Imprecise Ch04 p97 Many questions that data analysis can answer are based on comparing values of one variable, y, against values of another variable, x, and often other variables. Many questions that data analysis can answer are based on comparing values of one variable, y, by values of another variable, x, and often by other variables.
05-01 2022-09-30 Typo Ch05 p140 i.i.d. variables: identical and independently distributed variables i.i.d. variables: independent and independently distributed variables
05-01 2023-07-04 Error Ch05 p127 For example in the bin (0.8,1) we have 512 cases for the N=1000 exercise and 1250 cases for the N=500 exercise with values of 0.8 or 1. For example in the bin (0.8,1), containing values greater or equal to 0.8 but less than 1, we have 853 cases for the N=1000 (8.53%) exercise and 1451 cases for the N=500 exercise (14.5%)
06-01 2021-10-20 Typos Ch06 B1 p159 “That p-value is shown to be 0.0000, which means that it’s less than 0.000 05. According to step 2 above, we can further divide this by two, and that would lead to an even smaller p-value.” “That p-value is shown to be 0.0007, which means that it’s less than 0.05. According to step 2 above, we can further divide this by two, and that would lead to an even smaller p-value (0.00035)
06-02 2022-09-22 Wrong number Ch06 p146 The mean difference is −0.05 US dollars: online prices are, on average, 5 cents lower in this data. The mean difference is 0.05 US dollars: online prices are, on average, 5 cents higher in this data.
06-03 2022-09-22 Wrong number Ch06 p146 s = −0.05 s = 0.05
06-04 2022-09-22 Wrong number Ch06 p147, Fig6.1 Mean = −0.05 Mean = 0.05
06-05 2023-07-04 Typo Ch05 156 true average price difference is more than −30 cents or +20 cents true average price difference is more than −30 cents or +19 cents.

Part II

ID Date added Error Type Chapter Page Problematic Corrected
07-01 2023-01-02 Imprecise Ch07 p.179 The average slope has an important interpretation: it is the difference in average y that corresponds to different values of x, averaged across the entire range of x in the data. The average slope has an important interpretation. Take the yE = f(x) curve and consider different values of x and the corresponding difference in average y. The average slope is the average of these differences, calculated over entire range of x in the data.
08-01 2023-01-02 Typo Ch08 p209, Fig 8.2a Figure 8.2a y axis: ln(price, US dollars) Figure 8.2a y axis: Price (US dollars)
08-02 2023-07-02 Typo Ch08 p220 GDP per capita over 1 600 000 measured in thousand dollars, which is USD 1.6 trillion GDP per capita over 1 600 000 measured in thousand dollars, which is USD 1.6 billion.
09-01 2021-11-05 Miss reference Ch11 p.239 (about SE…) simple formula in that it is smaller the smaller Std[e], the larger Std[x], and the larger \(\sqrt n\). … A more precise definition would have a degree of freedom correction with \(\sqrt{n-2}\), see Under the Hood section 9.U4.
10-01 2021-03-06 Typo Ch10 B1 p.293 Table 10.6 N=217 N=207
10-02 2021-03-08 Missing Ch10 B1 p.285 Graph 10.2 Note, missing info Male: blue, female: green
11-01 2023-04-10 Typo + explain Ch11 315 Note that a model may be unbiased on average but not well calibrated. For instance, it may underestimate the probability when it’s high and underestimate it when it’s low. Note that a model may be unbiased on average but not well calibrated. For instance, it may underestimate the probability when it’s high (e.g. y^p=60% vs y=80%) and at the same time, overestimate it when it’s low (e.g. y^p=30% vs y=10%).
11-02 2023-06-06 Typo Ch11 p.312 Recall from Chapter 7, Section 4.U1, that goodness of fit Recall from Chapter 7, Section 7.9, that goodness of fit

Part III

ID Date added Error Type Chapter Page Problematic Corrected
14-01 2021-01-06 Imprecise sentence Ch14 B1 p.401 “The number of apartments or rooms is left as it is, and treated as continuous..” “The number of guests to accommodate or rooms is left as it is, and treated as continuous.”
14-02 2021-02-07 Typo Ch14 p.415 “two variables, \(x_i x_j\) and \(x_i^2 x_j\) and \(x_i^2 x_j\)” “two variables, \(x_i x_j\) and \(x_i^2 x_j\) and \(x_i x_j^2\)”
14-03 2021-02-13 Imprecise sentence Ch14 B1-B4 The currency is USD for price Actually, local currency (GBP) is used. Recently clarified
15-01 2021-01-19 Typo in number Ch15 p.423-24 In text, and Figure 15.3, cp=0.001 is wrong It’s cp=0.01
15-02 2021-01-19 Typo in text Ch15 p.427 “improved the R-squared in the test sample by less than” improved the R-squared in the train sample by less than
15-03 2021-07-13 Code vs text Ch15 p.431 “Therefore, it should be performed on the holdout set.” “However, it may be performed on the training set.”
15-04 2021-07-13 Wrong comment Ch15 p.433 “In Figure 15.7, we can look at variable importance for a regression tree on the holdout set. Note that the role of the holdout set is played by the single test set of 144 observations in this oversimplified case study.” “In Figure 15.7, we can look at variable importance for a regression tree . Note that used a the single training set in this oversimplified case study.”
15-05 2021-07-13 Code vs text Ch15 p.434 Figure 15.7 (holdout set, N=144). (training set, N=333).
15-06 2021-10-06 Code vs text comment Ch15 p.434 Figure 15.7 The variable importance plot has small values for features that are not part of the tree. This is not an error, just part of how some variable importance algorithms work (e.g. rpart in R) The reduction in the loss function attributed to each variable at each split is tabulated and the sum is returned. Also, since there may be candidate variables that are important but are not used in a split, the top competing variables are also tabulated at each split. Default in R. Not in Python.
16-01 2021-01-19 Wrong reference Ch16 p.443 “We have illustrated the basics of growing a regression tree using the airbnb dataset in a single London borough.” “We have illustrated the basics of growing a regression tree using the used-cars dataset.”
16-02 2021-07-13 Code vs text Ch16 p.444 “using the holdout sample that we set aside (Chapter 14, Section 14.7).” “using the training as well as the holdout sample that we set aside (Chapter 14, Section 14.7).”
16-03 2021-01-19 Imprecise Ch16 p. 445 “The partial dependence plot shows the values of the x variables within each copy of the data against the average predicted y from that data.” “The partial dependence plot shows the values of the x variables against the average predicted y on the holdout set.”
16-04 2021-07-13 Code vs text Ch16 p.447 Figure 16.1 footnote: “Variable importance based on predictions for the holdout set.” … “(holdout set, N=14 946)” “Variable importance based on predictions for the training set.” … “(work set, N=34 880)””
16-05 2021-07-13 Code vs text Ch16 p.448 Figure 16.2 footnote: “Variable importance based on predictions for the holdout set.” … “(holdout set, N=14 946)” “Variable importance based on predictions for the training set.” … “(work set, N=34 880)”
16-06 2021-01-20 Typo in graph numbers Ch16 p.448 Figure 16.2a and 16.2b wrong 16.2a and 16.2b titles should be swapped: 16.2a is “Factor variables grouped”; 16.2b is “Top 10 important variables”.
16-07 2021-02-09 Imprecise language Ch16 p.446-8, Box 16.3. PDP: it shows “average y,”, about the “\(y-x\) relationship” conditional on other x variables. The PDP shows average predicted y ( \(\hat{y}\)), about the “\(\hat{y} - x\) relationship” conditional on other variables.
16-08 2021-02-13 Imprecise sentence Ch16 A1-A3 The currency is USD for price Actually, local currency (GBP) is used. Recently clarified
17-01 2021-01-21 Typo numbers Ch17 p.479 “Yields 139 euros higher profit … increase of 139 000 euros in profits” “Yields 135 euros higher profit … increase of 135 000 euros in profits “
18-01 2021-07-12 Code vs table Ch18 p.509 “RMSE result for the VAR is RMSE=4.4” “RMSE result for the VAR is RMSE=4.5”
18-02 2021-07-12 Code vs table Ch18 p.510 M7 (var) RMSE line presents results without seasonality ( reads: 13.30, 5.85, 3.52, 4.28, 7.8) M7 (var) RMSE line should read: 5.24, 2.51, 5.18, 4.75, 4.5

Part IV

ID Date added Error Type Chapter Page Problematic Corrected
19-01 2021-02-16 Typo reference Ch19 p.562 “… with the help of a t-test (Chapter 6, Section 5.U1).”, “…and the false negative (see Chapter 6, Section 5.U1)” “… with the help of a t-test (Chapter 6, Section 3).”, “…and the false negative (see Chapter 6, Section 4)”
21-01 2021-03-01 Typo number Ch21 p.600 In Table 21.1, the number of observations in column 1 N=8440 N is 8439 not 8440
21-02 2021-03-01 Typo number Ch21 p.600 Formulae 21.17 and 21.21 are not correct, in the second term in the denominator. In the second term in the denominator, instead of x=0 there should be x=1
21-03 2021-05-11 Typo mumber Ch21 p.607 In Table 21.2, the number of matched observations (5751 and 5528) slightly off col 1: 5716, col 2: 5481
21-04 2021-05-11 Typo mumber Ch21 p.607 In Table 21.2, the number of observations in the second column (8827) is slightly off N is 8439 not 8227
22-01 2022-03-15 Text not match code Ch22 p.628 CS: “This definition of treated and untreated markets left some markets neither treated nor untreated: those with only American or only US Airways present in 2011. For the main analysis we dropped these from the data. It is possible that the merger affected these markets as well. In a data exercise you’ll be invited to examine if including these markets among the treated ones leads to different conclusions.” CS: “This definition of treated and untreated markets left some markets neither treated nor untreated: those with only American or only US Airways present in 2011. For the main analysis we kept these in the data as untreated. It is possible that the merger affected these markets as well. In a data exercise you’ll be invited to examine if excluding these markets among the treated ones leads to different conclusions.”
22-02 2022-03-15 Text not match code Ch22 p.647 E2: “Use the same airline-tickets-usa dataset that we used in the case study, with the same two years, 2011 and 2016. Re-do the analysis with an alternative treatment definition: markets that had either AA or US (or both) present at baseline. “ CS: “Use the same airline-tickets-usa dataset that we used in the case study, with the same two years, 2011 and 2016. Re-do the analysis with an alternative treatment definition: exclude markets that had either AA or US (or both) present at baseline. “
23-01 2022-03-15 Typo mumber Ch23 p.680 E4: “present results analogous to the ones in Tables 23.4 and 23.5, and discuss what you find” E4: “present results analogous to the ones in Tables 23.3, 24.4 and 23.5, and discuss what you find”
24-01 2020-12-09 Text not match code Ch24 B2 p.696 “When there was more than one candidate game within the same season for the same team, we selected the first one in the season.” “When there was more than one candidate game within the same season for the same team, we selected one in the season randomly.”
24-02 2021-06-07 Imprecise sentence Ch24 B2 page 698 “Here the intercept, \(\beta_0\), shows the average change in points in the reference time period, from 7–12 games before to 1–6 games before, for pseudo-interventions. \(\beta_1\) shows the average change in points from 1–6 games before to 1–6 games after, in addition to \(\beta_0\). \(\beta_2\) shows the average change in points from 1–6 games after to 7–12 games after, again, in addition to \(\beta_0\). Thus, the change from 1–6 games before the pseudo-intervention to 1–6 games after it is \(\beta_1 + \beta_0\).” "”Here, the intercept, \(\beta_0\) shows the average change in points in the reference time period: from event time window \([-12,-7]\) to event time window \([-6,-1]\), for the control group. \(\beta_1\) shows the average change in points from event time window \([-6,-1]\) to event time window \([1,6]\), compared to the change in the reference time period (captured by \(\beta_0\)), for the control group. \(\beta_2\) shows the average change in points from event time \([1,6]\) to \([7,12]\), again compared to the change in the reference time period, for the control group.”
24-03 2021-06-07 Imprecise sentence Ch24 B2 page 698 ”\(\beta_3\) shows the difference between the treatment and control group in terms of average point change from 7–12 games before to 1–6 before. If we selected the control group well, this should be close to zero.” ”\(\beta_3\) shows the treatment-control difference in the change in the reference time period (from \([-12,-7]\) 7-12 to \([-6,-1]\)). If we selected the control group well, \(\beta_3\) should be close to zero, because we want the control group to have the same pre-treatment changes in the outcome variable.”