Capstone Project — Session 1: Data Collection
Collect, combine and describe a match-and-manager dataset
Capstone Project — Session 1
Data collection: building a match-and-manager dataset
→ Project overview & research question
Learning objectives
By the end of this session your team will:
- Identify reliable data sources for your chosen country/league.
- Collect 10+ years of match-level and manager-change data.
- Structure and document the dataset with explicit quality checks.
- Be honest about what is missing and how it limits the later analysis.
Required dataset components
📊 Core dataset components
You will need six datasets. Details are up to you, here are some examples.
| Component | Details | ☐ |
|---|---|---|
| Game results | Date, home team, away team, goals, league, season | ☐ |
| Managers | date, team, manager | ☐ |
| Manager characteristics | Age, experience, nationality, etc | ☐ |
| Team information | team, season, location, quality, etc | ☐ |
| Data documentation | Schema, sources, number of observations, quality notes | ☐ |
| Quality assurance | No missing critical fields, consistent dates, no duplicates, standardized team names | ☐ |
Work tasks (2-hour block)
🎯 What you’ll do
- Discuss in teams — Which country/league? What sources exist?
- Self-interview — have your AI ask you clarifying questions before it writes code. (See Designing Larger Analytics Projects.)
- Plan — a short written
PLAN.md: sources, primary key, joins, tests, owner per task. - Execute — scrape / download / compile from your sources.
- Test — all three kinds: data tests, data-describe, code tests. See the testing section for what each means in practice.
- Understand the data you have — coverage: how many manager changes? how complete? what is missing for small clubs or early seasons?
Data descriptive statistics
You are expected to evaluate data quality and understand key characteristics of your dataset.
Basically create a few important exhibits that describe the data you have.
Discussion (last hour)
- Team work — How did you divide the work? What worked well? What was hard?
- Data quality
- Cross-country variation — How does availability change across leagues?
- Tests — What failed? What do you plan?
- How far ahead – what is left to do
- Next session readiness — API, clear UIDs.
Delivery (Session 1)
📦 What to hand in
- Who: by group — you finish what you started in class together.
- Deadline: Sunday 23:55 (the Sunday after this session).
- Where: your team’s GitHub repo for the project. (please keep private for now, share with us)
- What:
- Raw data (or a script that fetches it reproducibly) + processed dataset.
- A short
DATA.mddescribing each table, source, schema, and known issues. - Test files (data tests, data-describe checks, code tests).
- Data description (any way, notebook or html) with key statistics and exhibits describing the data you have.