Plans for second edition

The second edition, planned to be out late 2025 will focus on correcting errors, improving some explanations and adding minor edits overall. There may be a single new chapter.

Error correction

The most important plan will be correcting typos and errors based on the errata page.

typos, errors
improve unclear sentences
add a few lines of explanation when needed

Math

My plan is to add more math. This is to help better courses, students get what they need to pursue more advanced courses.

It’ll be a mix of math and explanations. Often sketching the proof.
They will go 80% to Under the Hood sections.

Data Analysis with AI BOX

I do not plan to have separate sections on AI. This is because things have changed so much and are changing rapidly. However, there will be “Data Analysis with AI” boxes scattered throughout the text (we need a new color…) They will talk about how AI can help in the given task. It reflect my take of not considering data analysis and AI separate, but AI as a new tool to help along the way.

There will be AI exercises at the end such as creating a dashboard to simulate a bit of theory.

**Chapter 01**
AI is often used to encode information from text. One such example is to extract sentiments, emotions or values. The large language models will not only capture keywords but understand semantics of a sentence and even context. 

Describe how AI could be used to encode WMS scores from interview transcripts instead of relying on trained graduate students' judgement.

Beyond

Each section will get a Beyond bit replacing Further readings. This will be somewhat longer and link to an ever growing online version called Beyond: Directions to Frontier. Basically an extra 1-2 paragraphs focusing on helping readers towards what’s new.

Applications

Short new bit at the end part of many sections. Basically a list of application for methods covered. Kind of a “What for” learning outcome. Sometimes comparing with previously covered methods.

(chapter 03) we showed histogram, here is an example from The economist.
(chapter 12) we showed you TS analysis here is an application to XX sales data

Dashboards

The ecosystem will grow with a series of Dashboards / apps to practice the material, or have interactive sessions.

Broad issues I’m thinking about

Chapter 10 is too large, and is set to be bigger. Some instructors say 07-09 is too slow (but maybe not for students!). Some magic rearrangement: merge 07 and 08, and cut ch 10 into two?
Case studies: Add 2-3 case studies.
- World Values Survey
- …

Improvements and additions

We plan several smaller improvements. Mostly adding some examples, better explanations. Also adding concepts based on feedback. Typically a few extra paragraphs, maybe a short new section.

A few broader changes are denoted with bold:

Frisch-Waugh-Lowell theorem
Prediction model interpretability
Staggered Diff-in-diffs (+ more discussion on design)

Part 1

Chapter	Topic	Idea
01	Cut DS	Shorten some bits on API and move to appendix.
02	Cut DS	Shorten some bits on wrangling and move to appendix.
03	Redo, expand on distributions	3.9, 3.U1 – Redo the theoretical distribution section. Bring pdf, cdf to main bit. Show pdf and cdf for normal, log-normal. Give more reason who they are useful when comparing cities, countries. Be more explicit re definitions of pareto, scale-free, power law, zipf’s law. Redo Pareto x axis
04	coding error and correlation	Show a dataset of
05	More FP/FN, costs	new short Case study. Maybe add a case study on estimating arrival time with simulation
05	more on what testing means	case study or exercise based on checking birthday effects in deaths, based on pudding piece
06	t-test fro, two samples	One para and the formula for independent sample means

Part 2

Chapter	Topic	Idea
08	Practice of standard errors	Discuss special cases. One source is Gelman’s JE bit but we had thhought about countlessly.
08	attenuation	Add example for attenuation bias from Feodora Teti customs data paper, real policy implications
10,21	dataviz	Add coeffplots
10,22	p-values	Show tables with p-values and stars, add a para discussion ref back to p-hacking + both have pros and cons
10	Regression vs test	U.x Discuss how regression may be the same as testing ideas
10	Hard q on confounders	Suppose I have two random variables, y and x. If I’m allowed to construct a third random variable z, I can guarantee that a regression $y = \beta_1 x + \beta_2 z$ will yield any value for beta1 I want Source
10	Exercise	Read and discuss obesity gap by Economist
10	Blinder-Oaxaca decomposition	Blinder-Oaxaca-Kitagwa – decomposing gender gap with education and age
10	Frisch-Waugh-Lowell theorem	In 10.4, add a short section on FWL (no proof), with a case study application. Showcase what partial out means. This replaces U.10.1! The key application will be a graph, ie show show scatterplot despite controls. Maybe use earnings case study. production function, $z$ is L. Or even add a new case study on Mankiw-Romer-Weil QJE growth regressions. Deepnote
11	Poisson	Add other regression models, especially Poisson. This needs a new case study, maybe number of goals and shots on goal? Would also add another discussion of the role of zeroes. Introduce Odds-ratio.
12	What is trend + seasonality really	Seasonality as human behavior. Example: Interest over time on Google Trends for Diet. Also note that trend was always thought of as coming from a confounder such as population growth affecting the demand curve- first volume of Eca.
12	Log run trends	Frisch-Waugh Trend in $y$ as a sum of trends in RHS vars. Long run trend in either, we can’t do much: either partial out or leave it in.

Part 3

Chapter	Topic	Idea
13	loss	Price prediction model trade-offs, loss function Kayak
14	var imp for OLS	For linear models in prediction, add a few para, new section on variable importance
14	ln OLS correction	More on what smearing does, when it’s better to use other formula, bias, MAE vs RMSE
14	Quant reg	If MAE is target, qreg is a way. MAE vs RMSE discussion
14	LASSO	The role of standardization, benefits and costs
16	CART	On a subset to have a small and nice tree on airbnb. Maybe in 15.
14	Correlated predictors	In any predictive model (OLS, RF), when we have many predictors that are correlated, we have problems: varimp and interpretation. Ideas: PCA, groupings, drop
16	interpretability	local vs global. marginal effects / SHAP for ML. For the machine learning bit, consider SHAP and LIME, other methods in addition to VIP
16	ensemble for OLS	For linear models in prediction, we can also have an ensemble model, ln+log (as assignment for cars)
16	cloud comp	Add run time in google colabs / amazon cloud for Table 16.4
16	advice	Add pointers on what can go wrong: leakage (+ how ols r2, varimp helps), variable content/availability change in live data, why not filter on target
17	F1	Add F1 to accuracy. Compare the role of Acc, F1, and AUC when (a) asymmetric loss fn, (b) class imbalance
18	ML for TS	Discuss the role of ML like Boosting for time series data. The fact that it does not take temporal aspects into account. Mention other specific models exists.

Part 4

Chapter	Topic	Idea
19	Intro to causality	friedmans-thermostat
19	DAGs	2 para more to link lingo closer to Pearl
19	SUTVA	2 para – Explicit about SUTVA
20	more on A/B test	Add a bit more on experiments in large companies like UBER Pool more source1 Microsoft 2009, Kohavi HBR
21	More on RDD	A more detailed example on RDD, maybe even a short case study
21	Good vs bad control	Two example stories with discussion on controls, confounders, mechanism and collider
24	Staggered DiD	Event study, maybe add one of new DiD method using the same football case study. One solution is bacon-goodman later treated as control.
24	Review case studey	Consider Brexit for synthetic control paper, great deal of options.

Data Science extra

I’m planning to increase the coverage of key “data science” tools, focusing on pre-analysis stuff. Some of this are alreday in the book, APIs, wrangling, transforming text to bag of words. But I plan to expand on them and spend more time on selected elements. David Card suggested to focus on join, so we will…

What I plan is an online material focusing on a dozen issues, often expanding what is mentioned in the book. Currently I plan: API, combing datasets, entitity resolution, basics of SQL, date and time, data storage (parquet, etc). Ideas? Suggestions? Ping me.

Data science section (online)

Chapter	Topic	Idea
02	join	Add more on joining tables, based on case studies
02	variable naming	Add a few para on naming variables, some ideas and when it’s really important. Extend 2.U1 or add 2.U2
04	Dashboards	What is a good dashboard, creating a simple one in shiny/quarto to show conditional means with hotel data
13	r vs python results	Add a few para/section on discussing that results that are borne out of algos without a close solution, will vary across platforms

Case studies, data sources

US Time share data – used obesity gap by Economist

Feedback

We are open to suggestions! Plase make a suggestion for a minor change or a short addition you think would be helpful HERE. Also report errors, pls.

Reviewer suggestions I got

Add a section on the Oaxaca-Blinder decomposition to the shapes chapter – AGREED
Add data on Asian countries – AGREED
Add formulae for advanced students – AGREE BUT NEEDS TO DECIDE WHAT EXACTLY
Add output from Python or R – * all output are from R. Can add online bit showing output and discussing differences*
Add more on Big Data – AGREED, new case study on GH data
Add Pearl’s approach for graphs – *OK, add some stuff to link up, no DO calculus
Add Text2Vec – NOT LIKELY
Broaden shape of data chapter to include data visualisation – ??
Combine chapters 7 to 9 – Well, some restrcuring must happen, not 100% sure what
Show different ways that LLMs can assist in analysing structured an unstructured data, e.g., categorising text – AGREED
Discuss pitfalls with LLMs too – AGREED
Discuss staggered treatment for Diff-in-Diff – AGREED
Expand discussion of panel data analyses – AGREED* via DiD
Add a chapter explaining usefulness of different methods; use Learning Outcomes to motivate students – Not a new chapter, but something maybe
Add figure to 3.10 showing steps for EDA – AGREED
In Chapter 6, discuss hypothesis formulation – Okay, will think what exactly it means
Add more on interpreting coefficients as elasticities AGREED
Add more about structural breaks in Chapter 8 – NOT LIKELY

Gábor Békés and Gábor Kézdi