Capstone Project — Session 1: Data Collection

Collect, combine and describe a match-and-manager dataset

Published

June 1, 2026

Capstone Project — Session 1

Data collection: building a match-and-manager dataset

→ Project overview & research question

Before you come to class (30–60 min)

✅ Pre-class checklist

Team & country — form your 3-person team and pick your country/league.
Repo — create the team’s GitHub repo (private for now, shared with us).
Scout sources — find at least one viable source for match results and one for manager changes.
Review terms — primary key, schema (glossary).

Learning objectives

By the end of this session your team will:

Identify reliable data sources for your chosen country/league.
Collect 10+ years of match-level and manager-change data.
Structure and document the dataset with explicit quality checks.
Be honest about what is missing and how it limits the later analysis.

Required dataset components

📊 Core dataset components

You will need six datasets. Details are up to you, here are some examples.

Component	Details	☐
Game results	Date, home team, away team, goals, league, season	☐
Managers	date, team, manager	☐
Manager characteristics	Age, experience, nationality, etc	☐
Team information	team, season, location, quality, etc	☐
Data documentation	Schema, sources, number of observations, quality notes	☐
Quality assurance	No missing critical fields, consistent dates, no duplicates, standardized team names	☐

Work tasks (2-hour block)

🎯 What you’ll do

Discuss in teams — Which country/league? What sources exist?
Self-interview — have your AI ask you clarifying questions before it writes code. (See Designing Larger Analytics Projects.)
Plan — a short written PLAN.md: sources, primary key, joins, tests, owner per task.
Execute — scrape / download / compile from your sources.
Test — all three kinds: data tests, data-describe, code tests. See the testing section for what each means in practice.
Understand the data you have — coverage: how many manager changes? how complete? what is missing for small clubs or early seasons?

Data descriptive statistics

You are expected to evaluate data quality and understand key characteristics of your dataset.

Basically create a few important exhibits that describe the data you have.

Discussion (last hour)

Team work — How did you divide the work? What worked well? What was hard?
Data quality
Cross-country variation — How does availability change across leagues?
Tests — What failed? What do you plan?
How far ahead – what is left to do
Next session readiness — API, clear UIDs.

Delivery (Session 1)

📦 What to hand in

Who: by group — you finish what you started in class together.
Deadline: Sunday 23:55 (the Sunday after this session).
Where: your team’s GitHub repo for the project. (please keep private for now, share with us)
What:
- Raw data (or a script that fetches it reproducibly) + processed dataset.
- A short DATA.md describing each table, source, schema, and known issues.
- Test files (data tests, data-describe checks, code tests).
- Data description (any way, notebook or html) with key statistics and exhibits describing the data you have.

--- title: "Capstone Project — Session 1: Data Collection" subtitle: "Collect, combine and describe a match-and-manager dataset" date: "2026-06-01" --- ::::::: {.hero-section} :::::: {.container} ::: {.hero-title} Capstone Project — Session 1 ::: ::: {.hero-subtitle} Data collection: building a match-and-manager dataset ::: :::::: ::::::: → **[Project overview & research question](../capstone/index.qmd)** ------------------------------------------------------------------------ ## Before you come to class (30–60 min) ::::: {.week-card .card} ::: card-header ✅ **Pre-class checklist** ::: ::: card-body - ☐ **Team & country** — form your 3-person team and pick your country/league. - ☐ **Repo** — create the team's GitHub repo (private for now, shared with us). - ☐ **Scout sources** — find at least one viable source for match results and one for manager changes. - ☐ **Review terms** — primary key, schema ([glossary](../da-knowledge/technical-terms-page.qmd)). ::: ::::: ------------------------------------------------------------------------ ## Learning objectives By the end of this session your team will: - Identify reliable data sources for your chosen country/league. - Collect 10+ years of **match-level** and **manager-change** data. - Structure and document the dataset with explicit quality checks. - Be honest about what is missing and how it limits the later analysis. ------------------------------------------------------------------------ ## Required dataset components ::::: {.week-card .card} ::: card-header 📊 **Core dataset components** ::: ::: card-body You will need six datasets. Details are up to you, here are some examples. | Component | Details | ☐ | |---|---|---| | **Game results** | Date, home team, away team, goals, league, season | ☐ | | **Managers** | date, team, manager | ☐ | | **Manager characteristics** | Age, experience, nationality, etc | ☐ | | **Team information** | team, season, location, quality, etc | ☐ | | **Data documentation** | Schema, sources, number of observations, quality notes | ☐ | | **Quality assurance** | No missing critical fields, consistent dates, no duplicates, standardized team names | ☐ | ::: ::::: ------------------------------------------------------------------------ ## Work tasks (2-hour block) ::::: {.week-card .card} ::: card-header 🎯 **What you'll do** ::: ::: card-body 1. **Discuss in teams** — Which country/league? What sources exist? 2. **Self-interview** — have your AI ask *you* clarifying questions before it writes code. (See [Designing Larger Analytics Projects](../da-knowledge/designing-projects.qmd#let-ai-interview-you-before-it-codes).) 3. **Plan** — a short written `PLAN.md`: sources, primary key, joins, tests, owner per task. 4. **Execute** — scrape / download / compile from your sources. 5. **Test** — all three kinds: data tests, data-describe, code tests. See the [testing section](../da-knowledge/designing-projects.qmd#three-kinds-of-tests) for what each means in practice. 6. **Understand the data you have** — coverage: how many manager changes? how complete? what is missing for small clubs or early seasons? ::: ::::: ### Data descriptive statistics You are expected to **evaluate** data quality and understand key characteristics of your dataset. Basically create a few important exhibits that describe the data you have. ----------------------------------------------------------------------- ## Discussion (last hour) - **Team work** — How did you divide the work? What worked well? What was hard? - **Data quality** - **Cross-country variation** — How does availability change across leagues? - **Tests** — What failed? What do you plan? - **How far ahead** -- what is left to do - **Next session readiness** — API, clear UIDs. ------------------------------------------------------------------------ ## Delivery (Session 1) ::::: {.week-card .card} ::: card-header 📦 **What to hand in** ::: ::: card-body - **Who:** by **group** — you finish what you started in class together. - **Deadline:** **Sunday 23:55** (the Sunday after this session). - **Where:** your team's GitHub repo for the project. (please keep private for now, share with us) - **What:** - Raw data (or a script that fetches it reproducibly) + processed dataset. - A short `DATA.md` describing each table, source, schema, and known issues. - Test files (data tests, data-describe checks, code tests). - Data description (any way, notebook or html) with key statistics and exhibits describing the data you have. ::: :::::