Data Analysis with AI

A comprehensive course

Author

Gábor Békés, Central European University (Austria, EU)

Published

May 1, 2026

What’s this

A course for students who already know data analysis / econometrics and want to seriously rework their practice around AI. The aim is not “tips and tricks”: it is to spend a semester building real things with AI as a teammate, and to think honestly about where it helps, where it lies, and where it lets you do work you genuinely could not do alone.

As AI becomes more and more powerful, it is also important to provide a platform to discuss human agency in data analysis. A core role of the instructor is to lead that discussion across the term — what we delegate, what we keep, and how we stay accountable for the result.

This is the 2026 Spring edition release

Two parts

The course has two clearly distinct halves:

Part 1 — Weeks 1–10. Use AI on tasks students mostly can do unaided: documentation, reporting, exploratory analysis, simple text work, control-variable selection, IV reasoning. The goal is fluency and good habits — chat-based work, then agentic CLIs (Claude Code), then AI as a research companion.
Part 2 — Capstone (3 sessions). A team project that pushes students into work they have never done before: production-style web scraping, text-as-data classification using LLM APIs, and modern causal econometrics (difference-in-differences with staggered treatment, event-study designs). The whole point is that AI makes this newly accessible — and the course makes students actually do it.

AI and me

At the end of all classes, instructors and students should always consider these three questions.

How did AI support me do what I planned.
How did AI fail me: gave half-truths, buggy code, imprecise arguments
How did AI extend me: helped do things I could not, or gave new ideas

Course description

Content — Part 1: weekly classes

Weeks 1–10 cover using large language models (LLMs — ChatGPT, Claude, Gemini, Le Chat and others) to carry out tasks in data analysis: code and data documentation, writing reports, agentic work in the terminal, going from raw data to a constrained PDF, text-as-data on football interviews, and using AI as a research companion to think about controls and instruments.

Several case studies run across the weekly material — simulated hotel data from Austria, the World Values Survey, US CPS earnings, football manager post-match interviews. There are weekly practice assignments and a learning-more collection with blogs, papers, and video recommendations.

Content — Part 2: capstone

A three-session team project on manager changes in football. Each team picks a country, builds a 10+ season match-and-manager panel, scrapes news around each change, uses an LLM API to score how expected each change was, then runs a difference-in-differences analysis with heterogeneity by manager type, team, and expectation.

The capstone deliberately requires three things students typically have not done:

Web scraping at project scale — RSS feeds, article pages, hundreds to thousands of items across sources and languages.
Text-as-data with LLM APIs — designing a classification prompt, structured outputs, hand-validation against the model, cost and reproducibility.
Modern causal econometrics — staggered DiD, event-study specifications, parallel-trends diagnostics, anticipation effects.

The argument of the course is that AI is what makes this scope realistic in three sessions — and the capstone is where students prove that to themselves.

Background: data analysis / econometrics

You need a background in Data Analysis / Econometrics, a good introductory course is enough. I, of course, suggest Chapters 1-12 and 19 of Data Analysis for Business, Economics and Policy (Cambridge UP, 2021). Full slideshows, data and code are open source. But consider buying the book!🤝

In particular, the course builds on Chapters 1-6 and 7-10, and 19 and 22-24 of Data Analysis but other Introductory Econometrics + basics of data science knowledge is ok.

Background: coding

Students are expected to have some basic coding knowledge in Python or R (Stata also fine for the early weeks). The capstone is realistically Python or R only — scraping and API work in Stata is painful.

Relevance

AI is everywhere and has become essential, most analytic work will be using it. It’s like the Internet a while back. Does not solve all problems, but almost all intellectual tasks will rely on inputs from it.

Learning Outcomes

By the end of the course, students will be able to

Use genAI fluently across the standard data-analysis stack — wrangling, description, reporting, light text analysis.
Work with agentic CLI tools (Claude Code and similar) on a real repo, with reproducibility and tests.
Scrape structured data from the web for a defined research goal.
Use an LLM API to turn text into numbers, including prompt design, structured output, and hand-validation.
Run a modern difference-in-differences / event-study analysis with staggered treatment and discuss its limits honestly.
Distinguish AI uses where the output is fine as-is from those that need strong human supervision.
Run a multi-week team project from data collection to causal analysis, using AI throughout.

Target audience

The course is aimed primarily at graduate and advanced-undergraduate students in economics, quantitative social science, political science, sociology, business analytics, and adjacent fields — students who have done a real econometrics or data-analysis course and now need to reset their practice around AI. The capstone in particular assumes you are willing to push past your current toolkit.

The material is also designed to be forkable by instructors: all open source under CC BY-NC-SA, so teachers can pick weeks, swap case studies, and run their own version. Practitioners, researchers, and journalists with the background can absolutely go through it solo.

Assignments

Assignments are available for all classes

Important to note for assignments: * Use AI but do not submit something that was created by AI. AI is your assistant. * One of the goals of the course is to practice this.

Weekly content (Part 1)

Tip

Before Week 1. Get your environment in shape and pick up a few prompting-for-code habits — see AI Coding Prep. Skip it if you already use Copilot/Cursor on real projects.

Week01: LLM Review

What are LLMs, how is the magic happening. A non-technical brief intro. How to work with LLMs? Plus ideas on applications. Includes suggested readings, podcasts, and vids to listen to. See also which AI model to use.

Content

Week02: Data and code discovery and documentation with AI

Learn how to write a clear and professional code and data documentation. LLMs are great help once you know the basics.

Case study: World Values Survey

Content

Week 03: Writing Reports

You have your data and task, and need to write a short report. We compare different options with LLM, from one-shot prompt to iteration.

Case study: World Values Survey

Content

Week04: Agentic AI with Claude Code

From chat to terminal - introducing Claude Code for data analysis. Students learn to use agentic AI that works directly with files, generates data, and iterates on analysis.

Case study: Austrian Hotels

Content

Week05: Advanced CLI Workflows

Going deeper with CLI tools: custom skills, project-specific instructions (CLAUDE.md), git integration, and autonomous execution. Turning CLI tools from clever assistants into reproducible research companions.

Content

Week06: From Data to Report

Download real CPS earnings data via CLI, contrast an undirected “vibe report” with a carefully directed economics-quality report. Iterative graph refinement, OLS regressions, and constrained PDF output.

Case study: US Earnings (CPS)

Content

Week07: Text as data 1 – intro lecture

No course of mine can escape football (soccer). Here we look at post-game interviews to learn basics of text analysis and apply LLMs in what they are best - context dependent learning. Two class series. First is more intro to natural language processing.

Case study: Football Manager Interviews

Content

Week08: Sentiment Analysis with AI

Second class, now we are in action. How does LLM compare to humans?

Case study: Football Manager Interviews

Content

Week09: AI as research companion: Control variables

Content

Week10: AI as research companion: Instrumental variables

Content

Capstone (Part 2): three sessions — doing things you have never done

A three-session team project on manager changes in football, deliberately scoped so that each session forces students into territory they have not been in before. Pick a country, build a 10+ season panel, score the news, run the causal design. — Project description

Session 1 — Data collection & description. Build a multi-table match-and-manager panel from messy public sources. Real entity resolution, real data quality work. — Content

Session 2 — From text to expectations (web scraping + LLM APIs). Scrape news articles around each manager change, then use an LLM API to score how expected the change was. Prompt design, structured output, hand-validation, cost. — Content

Session 3 — Difference-in-Differences + final presentation. Modern causal econometrics: staggered DiD, event-study, parallel-trends diagnostics, heterogeneity by manager / team / expectation. Honest presentation of what the data can and cannot say. — Content

Knowledge Base

Reusable reference pages on APIs, tools setup, project design, and more — see the Knowledge Base. For further reading beyond the course, check the beyond page.

Rights and acknowledgement

You can use it to teach and learn freely

Attribution: Békés, Gábor: “Data Analysis with AI: a comprehensive course”, available at gabors-data-analysis.com/ai-course/, v2.2. 2026-05-01.

You can fork it from the Github Repo. github.com/gabors-data-analysis/da-w-ai/

License: CC BY-NC-SA 4.0 – share, attribute, non-commercial (contact me for corporate gigs)

Textbook Please check out the textbook behind all this, buy it if you can. If interested teaching contact the Cambridge UP or me.

Thanks

Thanks: Developed mostly by me, Gábor Békés Thanks a million to the two wonderful human RAs, Ms Zsuzsanna Vadle and Mr Kenneth Colombe, both Phd students. Also thanks to Adam Víg, long term collaborator now at Google. Thanks to Claude and ChatGPT to craft pages, improve consistency, create the simulated dataset. They helped create the slideshows and educated me on a bunch of topics like reinforcement learning or NLP. This is a beautiful example of collaboration with great young people while heavily benefiting from advanced AI. Thanks for Quarto – it was all drafted and written in Quarto and Rstudio by Posit. Plus all the love from Github.

Thanks for CEU’s teaching grant that allowed me pay people and AI.

Questions and suggestions

This material is based my course at CEU in Vienna, Austria. Here is the Github repo

If you have questions or suggestions or interested to learn more, just fill in this form.

--- title: "Data Analysis with AI" subtitle: "A comprehensive course" author: "[Gábor Békés](https://sites.google.com/site/bekesg/), Central European University (Austria, EU)" date: "2026-05-01" --- ![Gabor's Data Analysis with AI](images/gabors-data-analysis-with-ai.png) ## What's this A course for students who already know **data analysis / econometrics** and want to seriously rework their practice around AI. The aim is not "tips and tricks": it is to spend a semester **building real things with AI as a teammate**, and to think honestly about where it helps, where it lies, and where it lets you do work you genuinely could not do alone. As AI becomes more and more powerful, it is also important to provide a platform to **discuss human agency in data analysis**. A core role of the instructor is to lead that discussion across the term — what we delegate, what we keep, and how we stay accountable for the result. **This is the 2026 Spring edition release** ### Two parts The course has two clearly distinct halves: - **Part 1 — Weeks 1–10.** Use AI on tasks students mostly *can* do unaided: documentation, reporting, exploratory analysis, simple text work, control-variable selection, IV reasoning. The goal is fluency and good habits — chat-based work, then agentic CLIs (Claude Code), then AI as a research companion. - **Part 2 — Capstone (3 sessions).** A team project that pushes students into work they have **never done before**: production-style **web scraping**, **text-as-data classification using LLM APIs**, and **modern causal econometrics** (difference-in-differences with staggered treatment, event-study designs). The whole point is that AI makes this newly accessible — and the course makes students actually do it. ### AI and me At the end of all classes, instructors and students should always consider these three questions. 1. How did AI **support** me do what I planned. 2. How did AI **fail** me: gave half-truths, buggy code, imprecise arguments 3. How did AI **extend** me: helped do things I could not, or gave new ideas ## Course description ### Content — Part 1: weekly classes Weeks 1–10 cover using large language models (LLMs — ChatGPT, Claude, Gemini, Le Chat and others) to **carry out tasks in data analysis**: code and data documentation, writing reports, agentic work in the terminal, going from raw data to a constrained PDF, text-as-data on football interviews, and using AI as a research companion to think about controls and instruments. Several **case studies** run across the weekly material — simulated hotel data from Austria, the World Values Survey, US CPS earnings, football manager post-match interviews. There are weekly practice **assignments** and a [learning-more collection](da-knowledge/beyond.html) with blogs, papers, and video recommendations. ### Content — Part 2: capstone A three-session team project on **manager changes in football**. Each team picks a country, builds a 10+ season match-and-manager panel, scrapes news around each change, uses an LLM API to score how *expected* each change was, then runs a difference-in-differences analysis with heterogeneity by manager type, team, and expectation. The capstone deliberately requires **three things students typically have not done**: 1. **Web scraping at project scale** — RSS feeds, article pages, hundreds to thousands of items across sources and languages. 2. **Text-as-data with LLM APIs** — designing a classification prompt, structured outputs, hand-validation against the model, cost and reproducibility. 3. **Modern causal econometrics** — staggered DiD, event-study specifications, parallel-trends diagnostics, anticipation effects. The argument of the course is that **AI is what makes this scope realistic in three sessions** — and the capstone is where students prove that to themselves. ### Background: data analysis / econometrics You need a background in **Data Analysis / Econometrics**, a good introductory course is enough. I, *of course*, suggest Chapters 1-12 and 19 of [Data Analysis for Business, Economics and Policy (Cambridge UP, 2021)](https://gabors-data-analysis.com/getting-started). Full slideshows, data and code are open source. But consider buying the book!🤝 In particular, the course builds on [Chapters 1-6 and 7-10, and 19 and 22-24 of Data Analysis ](https://gabors-data-analysis.com/chapter-details/) but other Introductory Econometrics + basics of data science knowledge is ok. ### Background: coding Students are expected to have some basic **coding knowledge** in Python or R (Stata also fine for the early weeks). The capstone is realistically Python or R only — scraping and API work in Stata is painful. ### Relevance AI is everywhere and has become essential, most analytic work will be using it. It's like the Internet a while back. Does not solve all problems, but almost all intellectual tasks will rely on inputs from it. ### Learning Outcomes By the end of the course, students will be able to * Use genAI fluently across the standard data-analysis stack — wrangling, description, reporting, light text analysis. * Work with **agentic CLI tools** (Claude Code and similar) on a real repo, with reproducibility and tests. * **Scrape** structured data from the web for a defined research goal. * Use an **LLM API** to turn text into numbers, including prompt design, structured output, and hand-validation. * Run a **modern difference-in-differences / event-study** analysis with staggered treatment and discuss its limits honestly. * Distinguish AI uses where the output is fine as-is from those that need strong human supervision. * Run a multi-week team project from data collection to causal analysis, using AI throughout. ### Target audience The course is aimed primarily at **graduate and advanced-undergraduate students** in economics, quantitative social science, political science, sociology, business analytics, and adjacent fields — students who have done a real econometrics or data-analysis course and now need to reset their practice around AI. The capstone in particular assumes you are willing to push past your current toolkit. The material is also designed to be **forkable by instructors**: all open source under CC BY-NC-SA, so teachers can pick weeks, swap case studies, and run their own version. Practitioners, researchers, and journalists with the background can absolutely go through it solo. ## Assignments Assignments are available for all classes Important to note for assignments: * Use AI but do not submit something that was created by AI. AI is your assistant. * One of the goals of the course is to practice this. ## Weekly content (Part 1) ::: {.callout-tip} **Before Week 1.** Get your environment in shape and pick up a few prompting-for-code habits — see [AI Coding Prep](da-knowledge/ai-coding-prep.qmd). Skip it if you already use Copilot/Cursor on real projects. ::: **Week01: LLM Review** What are LLMs, how is the magic happening. A non-technical brief intro. How to work with LLMs? Plus ideas on applications. Includes suggested readings, podcasts, and vids to listen to. See also [which AI model to use](da-knowledge/which-ai.html). [Content](week01/) **Week02**: Data and code discovery and documentation with AI Learn how to write a clear and professional code and data documentation. LLMs are great help once you know the basics. Case study: [World Values Survey](case-studies/VWS/) [Content](week02/) **Week 03**: Writing Reports You have your data and task, and need to write a short report. We compare different options with LLM, from one-shot prompt to iteration. Case study: [World Values Survey](case-studies/VWS/) [Content](week03/) **Week04**: Agentic AI with Claude Code From chat to terminal - introducing Claude Code for data analysis. Students learn to use agentic AI that works directly with files, generates data, and iterates on analysis. Case study: [Austrian Hotels](case-studies/austria-hotels/) [Content](week04/) **Week05**: Advanced CLI Workflows Going deeper with CLI tools: custom skills, project-specific instructions (CLAUDE.md), git integration, and autonomous execution. Turning CLI tools from clever assistants into reproducible research companions. [Content](week05/) **Week06**: From Data to Report Download real CPS earnings data via CLI, contrast an undirected "vibe report" with a carefully directed economics-quality report. Iterative graph refinement, OLS regressions, and constrained PDF output. Case study: [US Earnings (CPS)](case-studies/earnings/) [Content](week06/) **Week07**: Text as data 1 -- intro lecture No course of mine can escape football (soccer). Here we look at post-game interviews to learn basics of text analysis and apply LLMs in what they are best - context dependent learning. Two class series. First is more intro to natural language processing. Case study: [Football Manager Interviews](case-studies/interviews/) [Content](week07/) **Week08**: Sentiment Analysis with AI Second class, now we are in action. How does LLM compare to humans? Case study: [Football Manager Interviews](case-studies/interviews/) [Content](week08/) **Week09**: AI as research companion: Control variables [Content](week09/) **Week10**: AI as research companion: Instrumental variables [Content](week10/) ## Capstone (Part 2): three sessions — doing things you have never done A three-session team project on **manager changes in football**, deliberately scoped so that each session forces students into territory they have not been in before. Pick a country, build a 10+ season panel, score the news, run the causal design. — [Project description](capstone/) **Session 1 — Data collection & description.** Build a multi-table match-and-manager panel from messy public sources. Real entity resolution, real data quality work. — [Content](project01/) **Session 2 — From text to expectations (web scraping + LLM APIs).** Scrape news articles around each manager change, then use an LLM API to score how *expected* the change was. Prompt design, structured output, hand-validation, cost. — [Content](project02/) **Session 3 — Difference-in-Differences + final presentation.** Modern causal econometrics: staggered DiD, event-study, parallel-trends diagnostics, heterogeneity by manager / team / expectation. Honest presentation of what the data can and cannot say. — [Content](project03/) ## Knowledge Base Reusable reference pages on APIs, tools setup, project design, and more — see the [Knowledge Base](navbar/resources.html). For further reading beyond the course, check the [beyond](da-knowledge/beyond.html) page. --- ## Rights and acknowledgement ### You can use it to teach and learn freely **Attribution**: Békés, Gábor: "Data Analysis with AI: a comprehensive course", available at [gabors-data-analysis.com/ai-course/](https://gabors-data-analysis.com/ai-course/), v2.2. 2026-05-01. You can fork it from the Github Repo. [github.com/gabors-data-analysis/da-w-ai/](https://github.com/gabors-data-analysis/da-w-ai/) **License**: [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) -- share, attribute, non-commercial (contact me for corporate gigs) **Textbook** Please check out the [textbook](https://gabors-data-analysis.com/getting-started) behind all this, buy it if you can. If interested teaching contact the [Cambridge UP](https://www.cambridge.org/highereducation/books/data-analysis-for-business-economics-and-policy/D67A1B0B56176D6D6A92E27F3F82AA20#overview) or me. ## Thanks **Thanks:** Developed mostly by [me, Gábor Békés](https://sites.google.com/site/bekesg/) Thanks a million to the two wonderful human RAs, [Ms Zsuzsanna Vadle](https://bsky.app/profile/zsuzsannavadle.bsky.social) and [Mr Kenneth Colombe](https://bsky.app/profile/kcolombe24.bsky.social), both Phd students. Also thanks to [Adam Víg](https://www.linkedin.com/in/adam-vig-250729196/), long term collaborator now at Google. Thanks to Claude and ChatGPT to craft pages, improve consistency, create the simulated dataset. They helped create the slideshows and educated *me* on a bunch of topics like reinforcement learning or NLP. This is a beautiful example of collaboration with great young people while heavily benefiting from advanced AI. Thanks for [Quarto](https://quarto.org/) -- it was all drafted and written in Quarto and Rstudio by Posit. Plus all the love from Github. Thanks for CEU's teaching grant that allowed me pay people and AI. ## Questions and suggestions This material is based my course at CEU in Vienna, Austria. Here is the [Github repo](https://github.com/gabors-data-analysis/da-w-ai) If you have questions or suggestions or interested to learn more, just [fill in this form](https://docs.google.com/forms/d/e/1FAIpQLSev0oaR2s71hvFTZjhTwCuCPL00ljYWAIjl0hoZQLTn_oG3KQ/viewform?usp=header).