Data Analysis with AI course: Scoring News with LLMs
Turning an article into a probability — what the LLM is really doing
Gábor Békés (CEU)
2026-04-21
Today’s text
“We lost control of the game for five minutes. But my feeling is that the players tried, really tried. They tried during the week and they tried today. In the past, if we had this kind of bad five minutes and we suffered two goals, we didn’t recover. Today is a different feeling.”
— Ruben Amorim, Man United manager, post-match, Nottingham Forest 2–2 Man United, 1 Nov 2025.
The question we want answered
Given just this article, what is the probability that Amorim is out as manager within the next 4 weeks?
Why it matters for the capstone
Project 2 asks you to build a per-change expectation score from news.
The LLM’s job is to read the text signal.
Form, league position, past finishes → those go into the stats model later.
Not a statistical model
No regression, no training on past sacking outcomes.
No knowledge that Man United fired two managers in 18 months.
Just generating text, one token at a time.
“0.18” is simply the string that felt most likely to follow your prompt.
So why does it work at all?
The LLM has read — during pretraining — thousands of articles that rhyme with this one:
Post-match quotes from managers under pressure
“We tried, really tried” phrasing
Bookmaker odds and next-manager speculation
Follow-up sacking articles written days or weeks later
When it generates “0.18,” it is implicitly averaging over all of that.
→ Think of it as a trained reader’s guess, expressed as a number.
Bias 1 — Compression to the middle
RLHF tuning teaches models that hedging is safe.
Without help, you rarely see below 0.05 or above 0.70.
Extremes require explicit permission (“use the full range”).
Bias 2 — Tone overweighting
The text sounds calm → the number comes out low.
Amorim ends on “different feeling” → model reads confidence.
The situation may be much tenser than the quote — but the model only sees the text.
Lesson: the LLM reads what’s on the page. Anything not on the page has to come from somewhere else.
Bias 3 — Context the model can’t see
What the text does not contain, the LLM does not know.
League position → stats model
Last-season finish → stats model
Ownership mood, board meeting, recent presser → stats model
Your job: give the LLM the text task and let the structured data live in the regression.
Strategy 1 — Crude
“Rate 0–1 the probability Amorim is sacked in 4 weeks.”
You get a number.
Directionally OK.
Noisy across articles, poorly calibrated.
Fine as a starting point. Bad as a final answer.
Strategy 2 — Anchored
Give the model a reference frame.
“Some Premier League articles signal a manager is on the brink (close to 1). Others suggest stability (close to 0). Rate this article on that scale.”
The model now has a peg.
Scores become more consistent across articles.
Still noisy within an article.
Strategy 3 — Structured output
Structured output = you tell the model to fill in a predefined template, not write free prose.
Why bother?
Machine-readable: no regex parsing.
Forces the model to address each field — skipping is not an option.
Auditable: you can show the evidence the model used.
Most APIs support this via JSON mode or tool-use schemas.
What is JSON?
JSON = JavaScript Object Notation. A universal format for structured data.
Every field has a name. Every value has a type. That’s the whole idea.
Structured JSON for scoring
Ask the LLM for exactly this shape:
{"p_sack_4w":0.18,"evidence_for":["we lost control of the game for five minutes","a year to the day since he was appointed"],"evidence_against":["we came from three good games","different feeling"],"is_relevant":true,"confidence":"medium"}
Two wins: the evidence fields force reading; the is_relevant flag filters noise downstream.
Strategy 4 — Reason then score
Two-step prompt:
“Give me a one-paragraph analysis of the article.”
“Now, based on your analysis, state the number.”
Equivalent to chain-of-thought prompting.
Worth 5–15% on most scoring tasks.
More tokens, slightly more cost.
Strategy 5 — Self-consistency
Call the same prompt 5–10 times at temperature 0.7. Take the mean.
Robust to single-sample noise.
Variance across samples is itself a signal — low variance = confident read, high variance = genuinely ambiguous.
In the capstone: this is free information. Use it.
Strategy 6 — Logprob-based (advanced)
Idea: don’t ask the model to state a probability — read the probability directly from the token distribution it’s already computing.
Constrain the answer to one word: {LOW, MEDIUM, HIGH}.
Read the probability of each candidate from the API.
Skips the model’s “stated probability” hedging.
More calibrated. Bit fiddlier.
Worth it for the bonus calibration task — see appendix note.
In-class activity — pick a strategy
We’ve just seen six ways to turn an article into a score:
Crude — just ask
Anchored — give a reference frame
Structured output (JSON)
Reason then score
Self-consistency — sample and average
Logprob-based — read the token distribution
In small groups, pick one strategy each and argue for it:
What does it do well?
Where does it fail?
Would you use it alone, or combined with another?
(We’ll compare across groups and assemble a default recipe together.)
In-class activity — read the article
Look at the full Reuters text.
Write down your own number: P(Amorim out in 4 weeks) on a 0–1 scale.
Then write one phrase from the article that pushed you up, and one that pushed you down.
(We’ll compare notes — and then we’ll ask Claude.)
Cues the LLM probably latched onto
Direction
Text cue
↑
“we lost control of the game for five minutes”
↑
“a year to the day since he was appointed”
↑
“two quickfire goals”
↓
“four successive Premier League victories”
↓
“stunning volley rescued United a deserved point”
↓
“different feeling”, “good confidence”
≈
“we could not win this game, but we are not going to lose”
Where does Claude land?
Typical range: 0.10–0.20 for 4-week probability on this article alone.
With strong anchoring to the quotes: can drift to 0.05–0.10.
The up-note ending (“different feeling”) pulls the number down.
Discussion: did you land in the same range? If not — which cue weighed more for you than for the model?
In-class activity — turn to AI and we’ll discuss
Pick three of these. Send them to your AI assistant. Bring back a proposal.
How should we handle articles that mention more than one team?
Should we filter by keyword before calling the LLM, or let the LLM filter?
How do we aggregate daily news into a weekly score?
If articles disagree sharply, do we trust the mean or the max?
Should is_relevant = false articles count as zero or drop from the mean?
How long before a change should we score — 4 weeks, 2 weeks, or 1 week?
Why this matters
There is no textbook answer for any of these.
Every design choice is a degree of freedom.
Your AI is a fast sparring partner — it won’t give you the answer, but it’ll surface the trade-offs.
The defensible answer comes from you + data + one validation pass.
Bonus task (3 points) — investigate calibration
Claim to test: does a score of 0.20 from the LLM really mean “roughly 20% of articles like this precede a sacking”?
Recipe:
Pick 30 manager-change events + 30 control weeks (same teams, no change).
For each, score 5 random articles from the preceding 2 weeks.
Plot: predicted score (x) vs. realized outcome (y) — use a calibration curve.
Compute the Brier score.
Report: where is the model well-calibrated? Where does it miss?
Extension (for full 3p): re-do step 2 using logprob-based scoring and compare.
Open issues — stuff to decide
Temperature: 0 (reproducible) or 0.7 with self-consistency?
Hand-validate: yes or no? If yes, how many articles?
Output format: raw number, JSON, or Likert scale?
Model: Sonnet (quality) or Haiku (cost) for hundreds of articles?
Relevance filter: separate LLM call, or one field inside the main call?
Aggregation: mean, max, or weighted-by-source across articles?
Time window: score 4 weeks, 12 weeks, or both?
Calibrate: Platt-scale the raw scores, or report percentiles, or report raw?
How to close these
Group decision in the next lab: pick a default for each.
Document the choice in your prompt_classifier.md.
Validate on a hand-labelled sample of 30–50 articles.
Iterate: if agreement is poor, it’s almost always the prompt, not the model.
Key takeaways
The LLM is a trained reader, not a statistical model.
It reads text cues. League position and form go into your regression, not the prompt.
Structured JSON is the difference between prototype and pipeline.
Self-consistency + logprobs are your tools when calibration matters.
Every scoring choice is a design decision — own it, document it, validate it.