Data Analysis with AI
Developing a new teaching material for shwoing how use genAI/LLM technologies to improve productivity in data analysis.
The project detal
We develop a new course material, “Data Analysis with AI“ to equip students already versed in core methods of data analysis on how best to harness generative AI (GenAI) technologies to improve their productivity. We will focus on large language models (LLMs) such as ChatGPT or Claude.
Topics
The material is based on the textbook Békés-Kézdi: Data Analysis for Business, Economics, and Policy, Cambridge University Press, 2021. as it takes the variety of topics from the book and shows how to implement tasks with AI and how enhance human analysts with AI.
The set of topics only covers data analysis but a wide variety of topics such as
- Data wrangling: Join two datasets
- Surveys: Create variable names from survey questions
- Analysis/modelling: Design and estimte a multivariate regression of wage gap
- Data discovery: Do an exploratiory analysis of unknown dataset
- Causal inference: Event study design and estimation
- Causal inference: Design an RCT (A/B test of an ad), including sample size planning
- Machine learning: Build a predictive model for prices
- Machine learning: Creating sparse model in high-dimensional dataset
Case studies
Each case study will consider a topic, a real life analytical question, a dataset, and compare the output of the AI vs an experienced analyst.
Take this example
Here’s a country-level dataset containing variables from the World Values Survey 7th wave, and the corresponding GDP per capita in 2019 obtained from the World Bank. I also attached the codebook for the WVS. PART I: If you were an economic analyst which 15 variables from the “Core Variables” would you investigate if they are related to GDP? Avoid sensitive topics. Include a wide variety of variables you think could be interesting to explore. Show your arguments why you chose them. PART II: Pick the four out this 15 variables that has the highest correlation with GDP. Create a graph (or a set of graphs) to show. Summarise your findings.
Here we’ll illustrate that
- PART I is done well
- PART II is mostly okay but highest correlation does not alaways work
- PART II graph needs input, then gets okay, but never 100%.