Capstone Project — Session 2: From Text to Expectations

Collect news, call an AI API, turn articles into a per-change expectation score

Published

April 27, 2026

Capstone Project — Session 2

From text to expectations — scrape, call an API, score expectations


Where we are

  • Session 1 → you have a cleaned match + manager-change dataset.
  • Session 2 (today) → for each manager change, you build a news-based expectation signal: was the incoming manager received positively or negatively before the performance shows up on the pitch?
  • Session 3 → you use that signal as a moderator in the DiD analysis.

The point is not to teach NLP from scratch — we did that in Week 07. The point is to scale it with APIs and plug the output back into your panel.


Learning objectives

By the end of this session your team will:

  • Identify relevant news sources for your country/league.
  • Scrape or pull articles into a structured table (one row per article).
  • Use an LLM API to classify each article into an expectation score.
  • Aggregate back to one row per manager change, ready for DiD.

Intro talk — quick version

🎞️ Slideshow — key concepts only

We’ll fly through the Text to Data slideshowjust the “how to turn text into structured signal” parts.

We skip the sentiment-scale exercises from Weeks 07–08 — you already did them. What matters here is:

  • What is a good classification target for this project?
  • How do you write a prompt that returns a structured, parseable answer?
  • How do you budget time and money when you have ~1,000 articles to classify?

APIs: what you need this session

APIs come up twice in this session: news/data APIs (to get articles) and LLM APIs (to classify them). You already saw these in Week 08; the key references are in the Knowledge Base.

🔑 Getting set up

  • Get an API key (OpenAI, Anthropic, or similar): How to get AI API keys.
  • Budget ~$5 — that is more than enough for a project this size.
  • Store the key outside the repo. Use an environment variable (.env, export) — never commit it.
  • Test first. Run your classification prompt on 10–20 articles before you unleash it on 1,000.

📚 Reference reading

Pulled into our knowledge base so you can find it after the course:

If you are new to APIs, spend 15 minutes on “Introduction to APIs” before you start your work block — it will save you an hour of confused debugging.


Work tasks (2-hour block)

🔍 News collection & classification pipeline

1. Find news sources

  • Aim for ≥ 2 credible outlets per country.
  • Prefer sources with RSS or a simple URL pattern — you’ll thank yourself tomorrow.
  • Mix national and league-specific / tabloid and broadsheet: drama lives in tabloids, context lives in broadsheets.

2. Build URL/article lists

  • Google News RSS (with a manager name in the query) is the fastest start.
  • If the source has team-specific RSS feeds, prefer those.
  • Expected scale: ~20 teams × 2 sources ≈ 1,000–1,600 articles / season.

3. Scrape clean text

  • Title + lead + full body. Keep source + pubDate + URL for reproducibility.
  • Store in a single tidy table: one row per article.

4. Classify via LLM API

  • Categories: Positive / Neutral / Negative expectations (keep it simple).
  • Prompt design — the hard part:
    • Instruct the model to ignore match-result reporting (results are already in your panel).
    • Ask for structured output (JSON with score, confidence, reason).
    • Return null for articles that are not about the change.
  • Validate on 20 hand-labelled articles before scaling.

5. Aggregate to one row per manager change

  • Window: 1 month before → 3 months after the change.
  • Aggregator: mean, max, or count of each category. Document the choice.
  • Output: a small table keyed on (team, change_date) ready to merge into the Session 3 DiD.

Scale & reality check

📐 Plan before you scrape

Dimension Example
Seasons 10
Manager changes per season ~30
Articles per change ~5
Total articles to classify ~1,500
Cost at typical pricing A couple of dollars

This is a lot of work for a 2-hour block. Scope down on purpose:

  • Start with 1 season, 1 source and a 3-month window.
  • Get the full pipeline working end to end.
  • Only then expand.

Discussion (last hour)

  • Coverage — Which manager changes got lots of press? Which got none? What does silence mean here?
  • Classification quality — Where did the model disagree with you on hand-labelled cases? Why?
  • Prompt engineering — Share one prompt change that noticeably improved output.
  • Cost & reproducibility — How many articles did you classify? What did it cost? Could somebody else rerun it from your repo?
  • Into Session 3 — Is your (team, change_date) → expectation_score table ready to merge into the panel?

Delivery (Session 2)

📦 What to hand in

  • Who: by group — continue with your Session 1 team and repo; finish what you started in the session.
  • Deadline: Sunday 23:55 (the Sunday after this session).
  • What:
    • Article table (one row per article) + classification table (one row per article with score + reason).
    • Aggregated expectations.csv: one row per (team, change_date).
    • Prompt file (prompt_classifier.md) + a short note on validation (n hand-labelled, agreement rate).
    • Updated DATA.md with the new tables documented.