Week 5 — Text as Data
Turning text into tabular data: humans vs AI, and scaling with APIs
Week 5 — Text as Data
Turning a series of short texts into tabular data — lexicons, ML, and LLMs via API — and the ground-truth problem

The second research-focused week. You turn unstructured text into numbers you can analyse, compare how you score text against how AI scores it, and build a small pipeline that scales from 5 texts to hundreds via an API. Running case: post-match football manager interviews.
Before you come to class (30–60 min)
✅ Pre-class checklist
Learning objectives
By the end of this unit you will be able to:
- Explain the classical NLP pipeline (tokenization → BoW → TF-IDF) and where LLMs fit.
- Compare the four ways to score sentiment: pre-built lexicons, custom lexicons, ML models, LLMs.
- Score text by hand and by AI, and reason about agreement and ground truth.
- Build a small text→data pipeline that calls an LLM API at scale, with a reusable prompt.
Session shape (200 min · 50·100·50)
| Block | Focus | Mode |
|---|---|---|
| Intro (50) | NLP pipeline; four approaches; humans vs AI | Talk + hand-rating |
| Task (100) | Build a text→data pipeline, classify via API | Individual / pairs |
| Discussion (50) | Validation, ground truth, reproducibility | Group |
Intro (50 min)
📖 From bag-of-words to LLMs
- Classical pipeline — tokenization, preprocessing (stemming, stop words), bag-of-words, TF-IDF.
- Why domain knowledge matters — “clinical” is a compliment in football; a generic lexicon misses that.
- Sentiment — detecting tone, not just which words appear.
Four ways to score sentiment (trade-offs matter):
| Pre-built lexicon | Custom lexicon | ML model | LLM (API) | |
|---|---|---|---|---|
| Setup | Low | Medium | High | Low |
| Domain fit | Low | High | Med–High | High |
| Context | Medium | Medium | High | Very high |
| Transparency | High | High | Low | Medium |
| Scalability | High | Medium | High | Low–Medium |
| Cost | Free | Dev time | Compute | API tokens |
Examples: VADER/TextBlob (pre-built); an AI-built football lexicon (custom); a fine-tuned BERT (ML); GPT/Claude as the scorer (LLM).
⚽ Hand-rate first (15 min)
Before any AI: rate the 5 manager statements yourself on the −2 … +2 scale. Then have an AI rate the same 5 — simple prompt vs detailed prompt — and compare its reasoning to yours. Look at the football-specific lexicon to see how domain knowledge changes interpretation.
Task block (100 min · individual or pairs)
Scale up from 5 texts to a real pipeline. Start small, validate, then expand.
🎯 Build a text → data pipeline
- Get the corpus. Use the combined interview dataset (text_id level) from the interviews case study. One row per text.
- Design the prompt. Decide the scale/labels, how to handle irrelevant text, and an output schema (e.g.
score,reason,confidence). Prefer structured output over free text. Sketch in sentiment guidelines. - Validate before scaling. Hand-label ~20 texts, run the prompt, compare. Aim for ≥ 80% agreement; if below, revise the prompt before going further.
- Classify via API. Call an LLM API over the full set. Reference: Calling LLM APIs from Python, and the case-study Python / R implementations.
- Compare methods. Put human average, domain-lexicon score, and AI average side by side. Where do they disagree most?
- Stretch. Ask the AI to predict the match result (win/draw/loss) or speaker gender from text alone — and discuss what linguistic cues it’s using and whether they’re reliable.
APIs beyond LLMs. Getting the data often means data APIs too. If you want practice: World Bank + FRED walkthrough and the FBref football walkthrough. How APIs work under the hood: api-advanced.
Discussion (50 min)
- Ground truth — what is the correct answer when humans disagree? How do you validate?
- Human vs AI — where did you and the AI diverge most? How did that feel?
- Consistency — did the AI rate similar texts consistently? Was there a systematic gap?
- APIs at scale — benefits and costs/risks of scaling text analysis with an API. Could someone rerun your pipeline?
Delivery
📦 What to hand in (Sunday 23:55)
- The pipeline (script/notebook) that reads texts, calls the API, and writes a scored table.
- The scored dataset — one row per text with
score(+reason/confidenceif used). - The reusable prompt (
prompt_classifier.md) and a short validation note: n hand-labelled, agreement rate.