Capstone Project — Session 1: Data Collection

Collect, combine and describe a match-and-manager dataset

Published

April 13, 2026

Capstone Project — Session 1

Data collection: building a match-and-manager dataset

Project overview & research question


Learning objectives

By the end of this session your team will:

  • Identify reliable data sources for your chosen country/league.
  • Collect 10+ years of match-level and manager-change data.
  • Structure and document the dataset with explicit quality checks.
  • Be honest about what is missing and how it limits the later analysis.

Required dataset components

📊 Core dataset components

You will need six datasets. Details are up to you, here are some examples.

Component Details
Game results Date, home team, away team, goals, league, season
Managers date, team, manager
Manager characteristics Age, experience, nationality, etc
Team information team, season, location, quality, etc
Data documentation Schema, sources, number of observations, quality notes
Quality assurance No missing critical fields, consistent dates, no duplicates, standardized team names

Work tasks (2-hour block)

🎯 What you’ll do

  1. Discuss in teams — Which country/league? What sources exist?
  2. Self-interview — have your AI ask you clarifying questions before it writes code. (See Designing Larger Analytics Projects.)
  3. Plan — a short written PLAN.md: sources, primary key, joins, tests, owner per task.
  4. Execute — scrape / download / compile from your sources.
  5. Test — all three kinds: data tests, data-describe, code tests. See the testing section for what each means in practice.
  6. Understand the data you have — coverage: how many manager changes? how complete? what is missing for small clubs or early seasons?

Data descriptive statistics

You are expected to evaluate data quality and understand key characteristics of your dataset.

Basically create a few important exhibits that describe the data you have.


Discussion (last hour)

  • Team work — How did you divide the work? What worked well? What was hard?
  • Data quality
  • Cross-country variation — How does availability change across leagues?
  • Tests — What failed? What do you plan?
  • How far ahead – what is left to do
  • Next session readiness — API, clear UIDs.

Delivery (Session 1)

📦 What to hand in

  • Who: by group — you finish what you started in class together.
  • Deadline: Sunday 23:55 (the Sunday after this session).
  • Where: your team’s GitHub repo for the project. (please keep private for now, share with us)
  • What:
    • Raw data (or a script that fetches it reproducibly) + processed dataset.
    • A short DATA.md describing each table, source, schema, and known issues.
    • Test files (data tests, data-describe checks, code tests).
    • Data description (any way, notebook or html) with key statistics and exhibits describing the data you have.