Week 05: using text as data
Week 05: using text as data
Turning a series of short texts into tabular data: humans vs AI
Overview
In this lesson, students will be introduced to sentiment analysis, specifically applied to evaluating general positivity or negativity in football managers’ statements about match outcomes.
Learning Outcomes
By the end of the session, students will:
- Understand core NLP concepts: tokenization, preprocessing, bag-of-words, and feature extraction in practical context.
- Gain hands-on experience with sentiment analysis.
- Understand the complexities and limitations of sentiment analysis.
Preparation / Before Class
📚 Background Reading
No specific readings required - we’ll learn by doing
- No need for natural language processing background.
- Thinking ahead: How would you manually determine if someone sounds “positive” or “negative” about something?
Review
📈 Assignment Review (15 min)
Key Learning Points:
- Fancy graphs != good graphs (good graph ← careful design)
- Precise interpretation >> BS
- Less is more
- Show only what you understand deeply
Less is More Principle: Simplicity often reveals insights better than complexity
Class Material
📊 Materials Setup
Data Context:
- These are real post-match interviews from Premier League managers discussing game results.
- Each text represents one manager’s full response to journalists after a match.
- Preprocessing: Replaced player and team names with “PLAYER” and “TEAM” - removes potential bias
Practice Dataset: Contains 5 post-match interviews from football managers
Reference Materials:
- General sentiment rating scale - our classification framework
- Domain-specific lexicon - football-specific sentiment terms
📖 NLP Fundamentals Lecture (25 min)
Key Concepts Covered:
- Tokenization: Breaking text into meaningful units
- Preprocessing: Stemming, lemmatization, stop word removal
- Bag of Words: Converting text to numerical features
- Domain knowledge: Why football-specific terms matter
- Sentiment Analysis: Methods from simple dictionaries to AI models
Practical Focus: How these concepts apply to manager interviews
⚽ Manual vs. AI Sentiment Workshop (40 min)
Phase 1: Human Rating (15 min)
- Individual work: Rate the 5 provided manager statements using the sentiment scale
- 5-unit scale: -2 (very negative) to +2 (very positive)
Phase 2: AI Rating (15 min)
- Use your preferred AI to rate the same 5 statements
- Try different prompting approaches: simple vs. detailed instructions
- Compare AI’s reasoning with your own thought process
Phase 3: Domain Lexicon Exploration (10 min)
- Examine the football-specific lexicon
- Discuss how domain knowledge affects interpretation
Discussion: Validation and Sentiment Analysis
- Objective: Discuss validation techniques used in sentiment analysis.
- Topics for discussion:
- Differences between manual and AI ratings
- Ground Truth
- Introduction to validation methods:
- If ground truth – can do confusion maztric, calculate accuracy
- If no ground truth – measure agreement between humans and AI. test difference.
- AI is average, but…
- AI with persona?
- AI biased ?
Extra: Prediction of score
If time permits: look at predicting score
- Modeling choices of results
- Think about how you would do it first
- Check how AI thinks about, rate the examples and look at explanations
- take the 5 examples, and compare your predictions vs the AI predictions
End of Week Discussion points
- How precise is AI in sentiment analysis?
- How did you compare to AI in terms of scores? How did any difference make you feel?
- Can you think of a past project where AI could have helped you upgrade it?
- Did AI provide consistent explanations for its ratings? When did its reasoning seem flawed? Did you feel it was mostly OK / half-truths?
Assignment
Due: Before Week 6
Compare manual ratings with AI-generated ratings on your own random sample of 25 interview texts.
Some personal comments on AI and this class
- It’s worth noting that my knowledge on NLP was bag of words, bigrams and TF-IDF. AI really helped making the jump.