Week 11: Zero to Hero Data Hack — Project Intro & Data Collection
Football managers: collect, combine and describe a match-and-manager dataset
Week 11: Zero to Hero Data Hack
Project intro + Session 1 — collecting, combining and describing data on football managers
The Project: Manager Impact in Football
For weeks 11–13 we leave the neat, pre-cleaned case studies behind and run a three-session data hack: one broad research question, messy real-world data, and AI as your primary teammate.
Research question: Does changing a manager improve team performance? And does the impact vary by manager and team characteristics?
Why this project
- It is specific (one domain, concrete data) but broad (no exact steps given to you).
- You will have to plan, layer tasks with AI, and decide — not just follow a recipe.
- Before you start, read the companion page Designing Larger Analytics Projects with AI — it covers the mindsets, self-interview,
agents.mdfile, and the three kinds of tests you’ll use all three weeks.
Scope
| Dimension | Choice |
|---|---|
| Sport | Football (soccer) |
| League tier | First division only |
| Countries (pick one per team) | Spain, Italy, France, Germany, Turkey, Scotland, Portugal, Netherlands, Poland, Ukraine, Russia, or another you can defend |
| Time horizon | 10+ seasons of historical data |
| Weekly deliverables | Reproducible repo and READMEs |
| Final deliverable | Reproducible repo + ~12-minute presentation |
Key questions to answer by Week 13
- What is the average effect of a manager change on team performance?
- Which types of managers show larger or smaller effects?
- Which types of teams respond more?
- Do expectations from news match the actual performance change?
Sessions at a glance
| Week | Focus | Deliverable by Sunday 23:55 |
|---|---|---|
| 11 | Data collection, combination, description | Documented dataset + QA notes |
| 12 | Text → expectations (APIs, scraping) | Article corpus + per-change expectation score |
| 13 | DiD analysis + heterogeneity + presentation | Results, slides, repo |
Session Structure (each week)
Each 200-minute session runs the same shape:
- Intro talk (≈ 30 mins) — key concepts, common pitfalls, decisions to make.
- Team work (≈ 120 mins) — you execute; AI assists; I circulate.
- Group discussion (≈ 50 mins) — share what worked, compare approaches, debrief.
WEEK 11 — Data Collection & Description
Learning objectives
By the end of this week your team will:
- Identify reliable data sources for your chosen country/league.
- Collect 10+ years of match-level and manager-change data.
- Structure and document the dataset with explicit quality checks.
- Be honest about what is missing and how it limits the later analysis.
Required dataset components
📊 Core dataset components
| Component | Details | ☐ |
|---|---|---|
| Game results | Date, home team, away team, goals, gameweek, season | ☐ |
| Manager changes | Date of change, incoming manager, previous manager, team | ☐ |
| Manager characteristics | Age, experience in league, international experience, previous clubs | ☐ |
| Team information | Budget or squad value (if available), historical performance, league | ☐ |
| Data documentation | Schema, sources, number of observations, quality notes | ☐ |
| Quality assurance | No missing critical fields, consistent dates, no duplicates, standardized team names | ☐ |
Work tasks (2-hour block)
🎯 What you’ll do
- Discuss in teams — Which country/league? What sources exist? What is realistic in 2 hours?
- Self-interview — have your AI ask you clarifying questions before it writes code. (See Designing Larger Analytics Projects.)
- Plan — a short written
PLAN.md: sources, primary key, joins, tests, owner per task. - Execute — scrape / download / compile from your sources.
- Test — all three kinds: data tests, data-describe, code tests. See the testing section for what each means in practice.
- Understand the data you have — coverage: how many manager changes? how complete? what is missing for small clubs or early seasons?
Data-source starting points
- Match results & squads: FBref (Sports Reference), Understat, Transfermarkt, Flashscore, Wikipedia per-season pages.
- Manager history: Transfermarkt manager pages, Wikipedia manager tenure tables, league official sites.
- Aggregators with APIs: Football-Data.co.uk (CSV dumps per season), StatsBomb Open Data (select leagues).
You are expected to evaluate each source (reliability, coverage, licence, cost) before committing.
Discussion (last hour)
- Data quality — What missing/outliers? What does your sample not cover?
- Cross-country variation — How does availability change across leagues?
- Tests — What failed? What did the failure reveal about your assumptions?
- Code quality — Functions vs scripts; what did you mock; how reproducible is the collection step?
- Next week readiness — Is the dataset stable enough to join news onto it?
Delivery (Week 11)
📦 What to hand in
- Who: by group — you finish what you started in class together.
- Deadline: Sunday 23:55 (the Sunday after this session).
- Where: your team’s GitHub repo for the hack.
- What:
- Raw data (or a script that fetches it reproducibly) + processed dataset.
- A short
DATA.mddescribing each table, source, schema, and known issues. - Test files (data tests, data-describe checks, code tests).
- A one-page
PLAN_NEXT.md: what you will need next week to join news on top.