Week 05: using text as data

Published

May 20, 2025

Turning a series of short texts into tabular data: humans vs AI

Overview

In this lesson, students will be introduced to sentiment analysis, specifically applied to evaluating general positivity or negativity in football managers’ statements about match outcomes.

Learning Outcomes

By the end of the session, students will:

Understand core NLP concepts: tokenization, preprocessing, bag-of-words, and feature extraction in practical context.
Gain hands-on experience with sentiment analysis.
Understand the complexities and limitations of sentiment analysis.

Preparation / Before Class

📚 Background Reading

No specific readings required - we’ll learn by doing

No need for natural language processing background.
Thinking ahead: How would you manually determine if someone sounds “positive” or “negative” about something?

Review

📈 Assignment Review (15 min)

Key Learning Points:

Fancy graphs != good graphs (good graph ← careful design)
Precise interpretation >> BS
Less is more
Show only what you understand deeply

Less is More Principle: Simplicity often reveals insights better than complexity

Class Material

📊 Materials Setup

Data Context:

These are real post-match interviews from Premier League managers discussing game results.
Each text represents one manager’s full response to journalists after a match.
Preprocessing: Replaced player and team names with “PLAYER” and “TEAM” - removes potential bias

Practice Dataset: Contains 5 post-match interviews from football managers

Reference Materials:

General sentiment rating scale - our classification framework
Domain-specific lexicon - football-specific sentiment terms

📖 NLP Fundamentals Lecture (25 min)

Slideshow: Text to Data

Key Concepts Covered:

Tokenization: Breaking text into meaningful units
Preprocessing: Stemming, lemmatization, stop word removal
Bag of Words: Converting text to numerical features
Domain knowledge: Why football-specific terms matter
Sentiment Analysis: Methods from simple dictionaries to AI models

Practical Focus: How these concepts apply to manager interviews

⚽ Manual vs. AI Sentiment Workshop (40 min)

Phase 1: Human Rating (15 min)

Individual work: Rate the 5 provided manager statements using the sentiment scale
5-unit scale: -2 (very negative) to +2 (very positive)

Phase 2: AI Rating (15 min)

Use your preferred AI to rate the same 5 statements
Try different prompting approaches: simple vs. detailed instructions
Compare AI’s reasoning with your own thought process

Phase 3: Domain Lexicon Exploration (10 min)

Examine the football-specific lexicon
Discuss how domain knowledge affects interpretation

Discussion: Validation and Sentiment Analysis

Objective: Discuss validation techniques used in sentiment analysis.
Topics for discussion:
- Differences between manual and AI ratings
- Ground Truth
- Introduction to validation methods:
- If ground truth – can do confusion maztric, calculate accuracy
- If no ground truth – measure agreement between humans and AI. test difference.
  - AI is average, but…
  - AI with persona?
  - AI biased ?

Extra: Prediction of score

If time permits: look at predicting score

Modeling choices of results
Think about how you would do it first
Check how AI thinks about, rate the examples and look at explanations
take the 5 examples, and compare your predictions vs the AI predictions

End of Week Discussion points

How precise is AI in sentiment analysis?
How did you compare to AI in terms of scores? How did any difference make you feel?
Can you think of a past project where AI could have helped you upgrade it?
Did AI provide consistent explanations for its ratings? When did its reasoning seem flawed? Did you feel it was mostly OK / half-truths?

Assignment

Assignment 5: Student-Specific Sentiment Analysis

Due: Before Week 6

Compare manual ratings with AI-generated ratings on your own random sample of 25 interview texts.

Full Assignment Details

Some personal comments on AI and this class

It’s worth noting that my knowledge on NLP was bag of words, bigrams and TF-IDF. AI really helped making the jump.

--- title: "Week 05: using text as data" subtitle "Turning a series of short texts into tabular data: humans vs AI" date: "2025-05-20" --- ::: {.hero-section} ::: {.container} ::: {.hero-title} Week 05: using text as data ::: ::: {.hero-subtitle} Turning a series of short texts into tabular data: humans vs AI ::: ::: ::: ![](../images/week5_pic.png) ## Overview In this lesson, students will be introduced to sentiment analysis, specifically applied to evaluating general positivity or negativity in football managers' statements about match outcomes. ### Learning Outcomes By the end of the session, students will: - Understand core NLP concepts: tokenization, preprocessing, bag-of-words, and feature extraction in practical context. - Gain hands-on experience with sentiment analysis. - Understand the complexities and limitations of sentiment analysis. ## Preparation / Before Class ::: {.week-card .card} ::: {.card-header} 📚 **Background Reading** ::: ::: {.card-body} **No specific readings required** - we'll learn by doing - No need for natural language processing background. - Thinking ahead: How would you manually determine if someone sounds "positive" or "negative" about something? ::: ::: ## Review ::: {.week-card .card} ::: {.card-header} 📈 **Assignment Review (15 min)** ::: ::: {.card-body} **Key Learning Points:** - Fancy graphs != good graphs (good graph ← careful design) - Precise interpretation >> BS - Less is more - Show only what you understand deeply **Less is More Principle:** Simplicity often reveals insights better than complexity ::: ::: ## Class Material ::: {.week-card .card} ::: {.card-header} 📊 **Materials Setup** ::: ::: {.card-body} **Data Context:** * These are real post-match interviews from Premier League managers discussing game results. * Each text represents one manager's full response to journalists after a match. * Preprocessing: Replaced player and team names with "PLAYER" and "TEAM" - removes potential bias **Practice Dataset: Contains 5 post-match interviews from football managers** - [Student test file (CSV)](/week05/assets/student_test_5.csv) - [Student test file (Excel)](/week05/assets/student_test_5.xlsx) **Reference Materials:** - [General sentiment rating scale](/week05/assets/sentiment-scale.html) - our classification framework - [Domain-specific lexicon](/data/interviews/domain_lexicon.csv) - football-specific sentiment terms ::: ::: ::: {.week-card .card} ::: {.card-header} 📖 **NLP Fundamentals Lecture (25 min)** ::: ::: {.card-body} **[Slideshow: Text to Data](https://gabors-data-analysis.com/courses/da-w-ai-2025/da-w-ai-05-text-to-data)** **Key Concepts Covered:** - **Tokenization:** Breaking text into meaningful units - **Preprocessing:** Stemming, lemmatization, stop word removal - **Bag of Words:** Converting text to numerical features - **Domain knowledge:** Why football-specific terms matter - **Sentiment Analysis:** Methods from simple dictionaries to AI models **Practical Focus:** How these concepts apply to manager interviews ::: ::: ::: {.week-card .card} ::: {.card-header} ⚽ **Manual vs. AI Sentiment Workshop (40 min)** ::: ::: {.card-body} **Phase 1: Human Rating (15 min)** - **Individual work:** Rate the 5 provided manager statements using the [sentiment scale](/week05/assets/sentiment-scale.html) - **5-unit scale:** -2 (very negative) to +2 (very positive) **Phase 2: AI Rating (15 min)** - Use your preferred AI to rate the same 5 statements - Try different prompting approaches: simple vs. detailed instructions - Compare AI's reasoning with your own thought process **Phase 3: Domain Lexicon Exploration (10 min)** - Examine the [football-specific lexicon](/data/interviews/domain_lexicon.csv) - Discuss how domain knowledge affects interpretation **Discussion: Validation and Sentiment Analysis** - **Objective:** Discuss validation techniques used in sentiment analysis. - **Topics for discussion:** - Differences between manual and AI ratings - Ground Truth - Introduction to validation methods: - If ground truth -- can do confusion maztric, calculate accuracy - If no ground truth -- measure **agreement** between humans and AI. test difference. - AI is average, but... - AI with persona? - AI biased ? **Extra: Prediction of score** If time permits: look at predicting score - Modeling choices of results - Think about *how* you would do it first - Check how AI thinks about, rate the examples and look at explanations - take the 5 examples, and compare your predictions vs the AI predictions ::: ::: ## End of Week Discussion points - How precise is AI in sentiment analysis? - How did *you* compare to AI in terms of scores? How did any difference make you feel? - Can you think of a past project where AI could have helped you upgrade it? - Did AI provide consistent explanations for its ratings? When did its reasoning seem flawed? Did you feel it was mostly OK / half-truths? ## Assignment ::: {.callout-note icon="📝"} ## Assignment 5: Student-Specific Sentiment Analysis **Due:** Before Week 6 Compare manual ratings with AI-generated ratings on your own random sample of 25 interview texts. [Full Assignment Details](../assignments/assignment_05.html){.assignment-badge} ::: ## Some personal comments on AI and this class * It's worth noting that my knowledge on NLP was bag of words, bigrams and TF-IDF. AI really helped making the jump.