Data Analysis with AI course

05 using text as data

Gábor Békés (CEU)

2025-03-24

Motivation

Why Text Analysis?

  • Unlocks unstructured data: Reports, news, social media, interviews
  • Quantifies qualitative information: Sentiment, topics, uncertainty
  • Supplements traditional economic data
  • Real-world applications: Central bank communication, market sentiment, policy analysis

Example: Economic Applications

  • Analyzing Federal Reserve statements to predict market reactions
  • Measuring Economic Policy Uncertainty through news coverage
  • Quantifying sentiment in earnings calls

What Is NLP?

  • What is NLP?
  • Definition: Techniques that help computers understand, interpret, and generate human language.
  • Applications:
    • Sentiment (positive/negative)
    • Topic classification
    • Summarization
  • Examples: analyzing newspaper articles for policy stances or corporate sentiment.
  • NLP is the foundation for many AI text applications (chatbots, search engines, etc.).

Today’s Case Study

“In the second half, especially after we scored for 1-1, I thought we were really impressive. We created so many opportunities, good chances. Then [the result] feels like a disappointment.”

Today’s Case Study

“In the second half, especially after we scored for 1-1, I thought we were really impressive. We created so many opportunities, good chances. Then [the result] feels like a disappointment.”

We’ll analyze a post-match interview with a Premier League manager to:

  1. Predict the match result
  2. Analyze sentiment patterns
  3. Determine if the manager attributes the outcome to luck

Sentiment Analysis

Detecting emotion and tone in text

Positive segments: > “In the second half, especially after we scored for 1-1, I thought we were really impressive.” > “PLAYER’s finishing is so clinical.”

Negative segments: > “Then [the result] feels like a disappointment.” > “We tried to cope with it, but every time we touched them we got a yellow…”

Mixed segments: > “In the first half we had a lot of problems with their intensity, aggressive playing style…”

Core Concepts in Text Analysis

Plan for this class

  • Different ways to analyze text
  • Turning text into tokens
  • Using AI to directly analyze them

We’ll cover the following concepts:

  • Tokenization
  • Preprocessing: Stemming vs. Lemmatization
  • Feature Extraction: Bag of Words in Depth

And discuss:

  • What these tools give us
  • Limitations

Bag of words

Corpus The full text material. Here interviews. Could be a book or articles.

Bag of words: Text representation that counts word occurrences, ignoring grammar and word order

  • Words - tokens
  • Tweaks

Tokenization

💡 Tokenization is breaking text into meaningful units (tokens)

Original text from our interview: > “In the second half, we were really impressive. We created many opportunities.”

Tokenized:

["In", "the", "second", "half", ",", "we", "were", "really", "impressive", ".", "We", "created", "many", "opportunities", "."]

Why it matters: Foundation for all text analysis - determines what counts as a “word”

Preprocessing: Stemming vs. Lemmatization

💡 Stemming: Algorithmically removes word endings - Fast but sometimes creates non-words

Examples from our interview

  • “playing” → “play”
  • “played” → “play”
  • “impressive” → “impress”
  • “disappointment” → “disappoint”

Purpose: Unifies different forms of the same word to avoid treating them as separate concepts

Preprocessing: Stemming vs. Lemmatization

Lemmatization: Converts to dictionary base form - More accurate but computationally intensive

Examples from our interview

  • “were” → “be”
  • “came” → “come”
  • “dominated” → “dominate”
  • “better” → “good”

Purpose: Unifies different variations of the same word to avoid treating them as separate concepts

Why Preprocessing Matters

Let’s look at our manager’s interview:

Original excerpt: > “We tried to cope with it, but every time we touched them we got a yellow and that doesn’t really help for us to be intense then as well.”

After preprocessing:

["try", "cope", "touch", "get", "yellow", "help", "intense", "good"]

Benefits:

  • Reduces vocabulary size (computational efficiency)
  • Focuses on meaningful content
  • Improves pattern detection
  • Enables comparisons across texts

Feature Extraction: Bag of Words

Bag of Words Text representation that counts word occurrences, ignoring grammar and word order

Example from our interview: > “In the first half we had a lot of problems with their intensity, aggressive playing style without the ball – aggressive in a good way.”

Bag of Words representation:

[ “first”: 1, “half”: 1, “lot”: 1, “problems”: 1, “intensity”: 1, “aggressive”: 2, “playing”: 1, “style”: 1, “ball”: 1, “good”: 1, “way”: 1]

What we can learn:

  • Word frequencies (most common terms)
  • Distinctive vocabulary
  • Comparison between documents
  • Simple sentiment indicators

Bag of Words: Tweaks

Limitations of Bag of Words

What’s missing from our coach’s interview:

  1. Word order: “We dominated them” vs. “They dominated us”
  2. Context: “not impressive” counts “impressive” as positive
  3. Phrases: “Premier League” becomes “premier” and “league”
  4. Relationships: Between players, actions, and outcomes

Extensions to address limitations:

  • N-grams capture local word order
  • TF-IDF weights words by importance across documents
  • Word embeddings capture semantic relationships

Practical Example: Word Frequency Analysis

Let’s analyze the coach’s interview:

Top 10 most frequent content words:

  1. PLAYER (10 mentions) 2. half (4 mentions) 3. played (3 mentions) 4. well (3 mentions) 5. intense/intensity (3 mentions) 6. good (3 mentions) 7. special (3 mentions) 8. better (2 mentions) 9. difficult (2 mentions) 10. impressive (2 mentions)

What this tells us: - Focus on individual player performance - Comparison between first and second half - Emphasis on quality of play and intensity

N-grams: Adding Context

💡 Bigrams (2-word phrases) from our interview:

  • “good way” (1)
  • “playing style” (1)
  • “special player” (1)

Trigrams (3-word phrases):

  • “played much better” (1)
  • “massive impact game” (1)

Benefits:

  • Captures phrases and contextual relationships
  • Preserves some word order information
  • Identifies common expressions and technical terms

TF-IDF: Beyond Simple Counts

💡 Term Frequency-Inverse Document Frequency (TF-IDF) Weights words by importance in a document compared to a collection

Example: Comparing our interview with other post-match interviews

Common words in all interviews (low TF-IDF):

  • “game”, “played”, “team”, “half”

Distinctive words in this interview (high TF-IDF):

  • “intensity”, “aggressive”, “clinical”, “impressive”

Benefits:

  • Reveals what makes this text unique
  • Reduces importance of common terms
  • Identifies signature vocabulary and themes

What we have

What we have

  • Bag of words extracted words (tokens)
  • Use 2-grams to have expressions (“yellow card”)
  • Select important words

What we need

  • link words to sentiments via dictionary
  • consider context

Sentiment Analysis

Sentiment Analysis

Detecting emotion and tone in text

Sentiment

Positive segments: > “In the second half, especially after we scored for 1-1, I thought we were really impressive.” > “PLAYER’s finishing is so clinical.”

Negative segments: > “Then [the result] feels like a disappointment.” > “We tried to cope with it, but every time we touched them we got a yellow…”

Mixed segments: > “In the first half we had a lot of problems with their intensity, aggressive playing style…”

Sentiment Analysis Libraries and Approaches

From Bag of Words to Sentiment Analysis

  • BoW tells us what words appear in text
  • Sentiment analysis tells us how those words convey emotion or opinion
  • Connection: Use word frequencies/presence to determine sentiment
  • Challenge: Words have different meanings in different contexts

Approaches to Sentiment Analysis

  1. Pre-built Lexicon Libraries
  2. Domain-Specific Custom Lexicons
  3. Machine Learning Models
  4. AI-Assisted Analysis

Approach 1: Pre-built Lexicon Libraries

Popular Libraries: - NLTK’s VADER, TextBlob, AFINN, SentiWordNet

Examples


const BASIC_LEXICON = {
  // Positive words
  positive: [
    'good', 'great', 'excellent', 'happy', 'pleased',... 'confidence', 'proud', 'amazing', 'fantastic'
  ],

code creates sentiment score

from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()
sia.polarity_scores("The manager was pleased with the performance")
# {'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.6249}

Pre-built Libraries: Pros and Cons

Advantages:

  • Ready to use with minimal setup
  • Validated on large datasets
  • Handle negations, intensifiers, punctuation
  • Work across many domains

Limitations:

  • Miss domain-specific terminology
  • May not capture specialized expressions
  • Limited contextual understanding
  • One-size-fits-all approach

Approach 1: How AI can help?

AI

  • Doing a sentiment analysis project used to be pretty advanced analytics job
  • AI now writes code to help

process

  • Explain task, show corpus
  • Iterate on code structure
  • Needs review, debugging
  • Great help re teaching you key concepts, libraries, explain how they work
  • Advantage: requires code design knowledge, experience

Approach 2: Domain-Specific Custom Lexicons with AI

Creating a sports-specific lexicon

prompt

Review this uploaded corpus. Create a new specific lexicon to analyse sentiments. Assign sentiment scores. Output is a domain_lexicon.csv.

  • The result is a piece of code with new lexicons.
  • Could add more context

Approach 2: Domain-Specific Custom Lexicons

Creating a sports-specific lexicon:

sports_positive_terms = [
    'clinical', 'dominated', 'resilient', 'character', 
    'intensity', 'pressing', 'quality', 'fight'
]

sports_negative_terms = [
    'lacked', 'mistakes', 'struggled', 'conceded', 
    'sloppy', 'slow', 'disorganized'
]

Domain-Specific Lexicons: Pros and Cons

Advantages:

  • Captures domain-specific expressions
  • More accurate for specialized texts
  • Can incorporate domain knowledge
  • Customizable scoring logic

Limitations:

  • Requires manual creation/curation
  • Less generalizable to other domains
  • Needs updates as language evolves
  • Time-consuming to build comprehensively

Approach 3: Machine Learning Models

Types of ML approaches:

  • Traditional ML, Deep Learning (CNN)
  • Transformer Models: BERT, RoBERTa – in practice

Advantages:

  • Capture complex patterns and relationships
  • Learn from data rather than rules
  • Better contextual understanding

Limitations:

  • Require labeled training data
  • Domain adaptation challenges

Approach 4: AI-Assisted Analysis

Using Large Language Models (LLMs): - GPT-4, Claude, etc. for sentiment analysis - Feed individual text examples for classification - Ask for nuanced, multi-dimensional analysis

prompt Rate the sentiment of this football manager interview on a scale from -2 (very negative) to +2 (very positive):

Provide only the numerical score.

OR

Provide only the numerical score and an explanation. Offer your level of confidence.

AI-Assisted Analysis: Example

Text:

“In the second half, especially after we scored for 1-1, I thought we were really impressive. We created so many opportunities, good chances. Then [the result] feels like a disappointment.”

LLM Response:

The sentiment score is +1 (somewhat positive).

The coach expresses clear satisfaction with the team’s second-half performance (“really impressive”), highlighting positive elements like creating “many opportunities.” However, there’s disappointment about the final result, creating a mixed but predominantly positive sentiment.

AI-Assisted Analysis: Pros and Cons

Advantages:

  • Sophisticated understanding of context and nuance
  • Adaptable to any domain without retraining
  • Can explain reasoning behind classification

Limitations:

  • Need to prompt sequentually (or use APIs)
  • Potential for inconsistency across analyses
  • Less transparent methodology

Selecting the Right Approach

Consider:

  1. Domain specificity required
  2. Available resources (time, computing, budget)
  3. Transparency needs for your analysis – reproducibility, stabiliy important?
  4. Scale of text analysis required: 100 vs 1,000 vs 100,000 pieces of texts

Hybrid approaches often work best: - Start with pre-built libraries but enhance with domain-specific terms for key concepts - compare with AI models (class 06)

Case Study: Football Interview Sentiment

class 05

  • Look at text and rate – compare with AI

Class 06

  • RQ1: how does sentiment vary after win, draw and loss
  • RQ2: Is there a difference between men and women coaches (managers)

Summary

How These Concepts Connect

NLP Pipeline

  1. Raw Text → Tokenization
  2. Tokens → Preprocessing
  3. Clean Tokens → Feature Extraction
  4. Features → Analysis (Sentiment)
  5. Results → Interpretation

From Text to Data (Implementation Steps)

steps

1.** Collect Corpus** (manager interviews)
2. Preprocess (clean, tokenize, possibly POS-tag)
3. Choose Method (BoW + library, advanced or LLM-based approach)
4. Analysis (sentiment score 0-1)
5. Discussion (numeric categories)

key issue

  • Code vs LLM
  • Result versus “truth”

Wrap-Up & Discussion

Key Takeaways:

  1. Text → Data → Insight
  2. BoW is a simple start
  3. Loads of advanced models –> Modern NLP methods
  4. LLMs bring powerful context