Week 5 — Text as Data

Turning text into tabular data: humans vs AI, and scaling with APIs

Published

June 1, 2026

Week 5 — Text as Data

Turning a series of short texts into tabular data — lexicons, ML, and LLMs via API — and the ground-truth problem


The second research-focused week. You turn unstructured text into numbers you can analyse, compare how you score text against how AI scores it, and build a small pipeline that scales from 5 texts to hundreds via an API. Running case: post-match football manager interviews.


Before you come to class (30–60 min)

Pre-class checklist


Learning objectives

By the end of this unit you will be able to:

  • Explain the classical NLP pipeline (tokenization → BoW → TF-IDF) and where LLMs fit.
  • Compare the four ways to score sentiment: pre-built lexicons, custom lexicons, ML models, LLMs.
  • Score text by hand and by AI, and reason about agreement and ground truth.
  • Build a small text→data pipeline that calls an LLM API at scale, with a reusable prompt.

Session shape (200 min · 50·100·50)

Block Focus Mode
Intro (50) NLP pipeline; four approaches; humans vs AI Talk + hand-rating
Task (100) Build a text→data pipeline, classify via API Individual / pairs
Discussion (50) Validation, ground truth, reproducibility Group

Intro (50 min)

📖 From bag-of-words to LLMs

Slideshow: Text to Data

  • Classical pipeline — tokenization, preprocessing (stemming, stop words), bag-of-words, TF-IDF.
  • Why domain knowledge matters — “clinical” is a compliment in football; a generic lexicon misses that.
  • Sentiment — detecting tone, not just which words appear.

Four ways to score sentiment (trade-offs matter):

Pre-built lexicon Custom lexicon ML model LLM (API)
Setup Low Medium High Low
Domain fit Low High Med–High High
Context Medium Medium High Very high
Transparency High High Low Medium
Scalability High Medium High Low–Medium
Cost Free Dev time Compute API tokens

Examples: VADER/TextBlob (pre-built); an AI-built football lexicon (custom); a fine-tuned BERT (ML); GPT/Claude as the scorer (LLM).

Hand-rate first (15 min)

Before any AI: rate the 5 manager statements yourself on the −2 … +2 scale. Then have an AI rate the same 5 — simple prompt vs detailed prompt — and compare its reasoning to yours. Look at the football-specific lexicon to see how domain knowledge changes interpretation.


Task block (100 min · individual or pairs)

Scale up from 5 texts to a real pipeline. Start small, validate, then expand.

🎯 Build a text → data pipeline

  1. Get the corpus. Use the combined interview dataset (text_id level) from the interviews case study. One row per text.
  2. Design the prompt. Decide the scale/labels, how to handle irrelevant text, and an output schema (e.g. score, reason, confidence). Prefer structured output over free text. Sketch in sentiment guidelines.
  3. Validate before scaling. Hand-label ~20 texts, run the prompt, compare. Aim for ≥ 80% agreement; if below, revise the prompt before going further.
  4. Classify via API. Call an LLM API over the full set. Reference: Calling LLM APIs from Python, and the case-study Python / R implementations.
  5. Compare methods. Put human average, domain-lexicon score, and AI average side by side. Where do they disagree most?
  6. Stretch. Ask the AI to predict the match result (win/draw/loss) or speaker gender from text alone — and discuss what linguistic cues it’s using and whether they’re reliable.
Tip

APIs beyond LLMs. Getting the data often means data APIs too. If you want practice: World Bank + FRED walkthrough and the FBref football walkthrough. How APIs work under the hood: api-advanced.


Discussion (50 min)

  • Ground truth — what is the correct answer when humans disagree? How do you validate?
  • Human vs AI — where did you and the AI diverge most? How did that feel?
  • Consistency — did the AI rate similar texts consistently? Was there a systematic gap?
  • APIs at scale — benefits and costs/risks of scaling text analysis with an API. Could someone rerun your pipeline?

Delivery

📦 What to hand in (Sunday 23:55)

  • The pipeline (script/notebook) that reads texts, calls the API, and writes a scored table.
  • The scored dataset — one row per text with score (+ reason/confidence if used).
  • The reusable prompt (prompt_classifier.md) and a short validation note: n hand-labelled, agreement rate.

Knowledge Base