Data Analysis with AI course

05 using text as data

Gábor Békés (CEU)

2025-03-24

Motivation

Why Text Analysis?

Unlocks unstructured data: Reports, news, social media, interviews
Quantifies qualitative information: Sentiment, topics, uncertainty
Supplements traditional economic data
Real-world applications: Central bank communication, market sentiment, policy analysis

Example: Economic Applications

Analyzing Federal Reserve statements to predict market reactions
Measuring Economic Policy Uncertainty through news coverage
Quantifying sentiment in earnings calls

What Is NLP?

What is NLP?
Definition: Techniques that help computers understand, interpret, and generate human language.
Applications:
- Sentiment (positive/negative)
- Topic classification
- Summarization
Examples: analyzing newspaper articles for policy stances or corporate sentiment.
NLP is the foundation for many AI text applications (chatbots, search engines, etc.).

Today’s Case Study

“In the second half, especially after we scored for 1-1, I thought we were really impressive. We created so many opportunities, good chances. Then [the result] feels like a disappointment.”

Today’s Case Study

“In the second half, especially after we scored for 1-1, I thought we were really impressive. We created so many opportunities, good chances. Then [the result] feels like a disappointment.”

We’ll analyze a post-match interview with a Premier League manager to:

Predict the match result
Analyze sentiment patterns
Determine if the manager attributes the outcome to luck

Sentiment Analysis

Detecting emotion and tone in text

Positive segments: > “In the second half, especially after we scored for 1-1, I thought we were really impressive.” > “PLAYER’s finishing is so clinical.”

Negative segments: > “Then [the result] feels like a disappointment.” > “We tried to cope with it, but every time we touched them we got a yellow…”

Mixed segments: > “In the first half we had a lot of problems with their intensity, aggressive playing style…”

Core Concepts in Text Analysis

Plan for this class

Different ways to analyze text
Turning text into tokens
Using AI to directly analyze them

We’ll cover the following concepts:

Tokenization
Preprocessing: Stemming vs. Lemmatization
Feature Extraction: Bag of Words in Depth

And discuss:

What these tools give us
Limitations

Bag of words

Corpus The full text material. Here interviews. Could be a book or articles.

Bag of words: Text representation that counts word occurrences, ignoring grammar and word order

Words - tokens
Tweaks

Tokenization

💡 Tokenization is breaking text into meaningful units (tokens)

Original text from our interview: > “In the second half, we were really impressive. We created many opportunities.”

Tokenized:

["In", "the", "second", "half", ",", "we", "were", "really", "impressive", ".", "We", "created", "many", "opportunities", "."]

Why it matters: Foundation for all text analysis - determines what counts as a “word”

Preprocessing: Stemming vs. Lemmatization

💡 Stemming: Algorithmically removes word endings - Fast but sometimes creates non-words

Examples from our interview

“playing” → “play”
“played” → “play”
“impressive” → “impress”
“disappointment” → “disappoint”

Purpose: Unifies different forms of the same word to avoid treating them as separate concepts

Preprocessing: Stemming vs. Lemmatization

Lemmatization: Converts to dictionary base form - More accurate but computationally intensive

Examples from our interview

“were” → “be”
“came” → “come”
“dominated” → “dominate”
“better” → “good”

Purpose: Unifies different variations of the same word to avoid treating them as separate concepts

Why Preprocessing Matters

Let’s look at our manager’s interview:

Original excerpt: > “We tried to cope with it, but every time we touched them we got a yellow and that doesn’t really help for us to be intense then as well.”

After preprocessing:

["try", "cope", "touch", "get", "yellow", "help", "intense", "good"]

Benefits:

Reduces vocabulary size (computational efficiency)
Focuses on meaningful content
Improves pattern detection
Enables comparisons across texts

Feature Extraction: Bag of Words

Bag of Words Text representation that counts word occurrences, ignoring grammar and word order

Example from our interview: > “In the first half we had a lot of problems with their intensity, aggressive playing style without the ball – aggressive in a good way.”

Bag of Words representation:

[ “first”: 1, “half”: 1, “lot”: 1, “problems”: 1, “intensity”: 1, “aggressive”: 2, “playing”: 1, “style”: 1, “ball”: 1, “good”: 1, “way”: 1]

What we can learn:

Word frequencies (most common terms)
Distinctive vocabulary
Comparison between documents
Simple sentiment indicators

Bag of Words: Tweaks

Limitations of Bag of Words

What’s missing from our coach’s interview:

Word order: “We dominated them” vs. “They dominated us”
Context: “not impressive” counts “impressive” as positive
Phrases: “Premier League” becomes “premier” and “league”
Relationships: Between players, actions, and outcomes

Extensions to address limitations:

N-grams capture local word order
TF-IDF weights words by importance across documents
Word embeddings capture semantic relationships

Practical Example: Word Frequency Analysis

Let’s analyze the coach’s interview:

N-grams: Adding Context

💡 Bigrams (2-word phrases) from our interview:

“good way” (1)
“playing style” (1)
“special player” (1)

Trigrams (3-word phrases):

“played much better” (1)
“massive impact game” (1)

Benefits:

Captures phrases and contextual relationships
Preserves some word order information
Identifies common expressions and technical terms

TF-IDF: Beyond Simple Counts

💡 Term Frequency-Inverse Document Frequency (TF-IDF) Weights words by importance in a document compared to a collection

Example: Comparing our interview with other post-match interviews

Common words in all interviews (low TF-IDF):

“game”, “played”, “team”, “half”

Distinctive words in this interview (high TF-IDF):

“intensity”, “aggressive”, “clinical”, “impressive”

Benefits:

Reveals what makes this text unique
Reduces importance of common terms
Identifies signature vocabulary and themes

What we have

What we have

Bag of words extracted words (tokens)
Use 2-grams to have expressions (“yellow card”)
Select important words

What we need

link words to sentiments via dictionary
consider context

Sentiment Analysis

Detecting emotion and tone in text

Sentiment

Positive segments: > “In the second half, especially after we scored for 1-1, I thought we were really impressive.” > “PLAYER’s finishing is so clinical.”

Negative segments: > “Then [the result] feels like a disappointment.” > “We tried to cope with it, but every time we touched them we got a yellow…”

Mixed segments: > “In the first half we had a lot of problems with their intensity, aggressive playing style…”

Sentiment Analysis Libraries and Approaches

From Bag of Words to Sentiment Analysis

BoW tells us what words appear in text
Sentiment analysis tells us how those words convey emotion or opinion
Connection: Use word frequencies/presence to determine sentiment
Challenge: Words have different meanings in different contexts

Approaches to Sentiment Analysis

Pre-built Lexicon Libraries
Domain-Specific Custom Lexicons
Machine Learning Models
AI-Assisted Analysis

Approach 1: Pre-built Lexicon Libraries

Popular Libraries: - NLTK’s VADER, TextBlob, AFINN, SentiWordNet

Examples


const BASIC_LEXICON = {
  // Positive words
  positive: [
    'good', 'great', 'excellent', 'happy', 'pleased',... 'confidence', 'proud', 'amazing', 'fantastic'
  ],

code creates sentiment score

from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()
sia.polarity_scores("The manager was pleased with the performance")
# {'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.6249}

Pre-built Libraries: Pros and Cons

Advantages:

Ready to use with minimal setup
Validated on large datasets
Handle negations, intensifiers, punctuation
Work across many domains

Limitations:

Miss domain-specific terminology
May not capture specialized expressions
Limited contextual understanding
One-size-fits-all approach

Approach 1: How AI can help?

Doing a sentiment analysis project used to be pretty advanced analytics job
AI now writes code to help

process

Explain task, show corpus
Iterate on code structure
Needs review, debugging
Great help re teaching you key concepts, libraries, explain how they work
Advantage: requires code design knowledge, experience

Approach 2: Domain-Specific Custom Lexicons with AI

Creating a sports-specific lexicon

prompt

Review this uploaded corpus. Create a new specific lexicon to analyse sentiments. Assign sentiment scores. Output is a domain_lexicon.csv.

The result is a piece of code with new lexicons.
Could add more context

Approach 2: Domain-Specific Custom Lexicons

Creating a sports-specific lexicon:

sports_positive_terms = [
    'clinical', 'dominated', 'resilient', 'character', 
    'intensity', 'pressing', 'quality', 'fight'
]

sports_negative_terms = [
    'lacked', 'mistakes', 'struggled', 'conceded', 
    'sloppy', 'slow', 'disorganized'
]

Domain-Specific Lexicons: Pros and Cons

Advantages:

Captures domain-specific expressions
More accurate for specialized texts
Can incorporate domain knowledge
Customizable scoring logic

Limitations:

Requires manual creation/curation
Less generalizable to other domains
Needs updates as language evolves
Time-consuming to build comprehensively

Approach 3: Machine Learning Models

Types of ML approaches:

Traditional ML, Deep Learning (CNN)
Transformer Models: BERT, RoBERTa – in practice

Advantages:

Capture complex patterns and relationships
Learn from data rather than rules
Better contextual understanding

Limitations:

Require labeled training data
Domain adaptation challenges

Approach 4: AI-Assisted Analysis

Using Large Language Models (LLMs): - GPT-4, Claude, etc. for sentiment analysis - Feed individual text examples for classification - Ask for nuanced, multi-dimensional analysis

prompt Rate the sentiment of this football manager interview on a scale from -2 (very negative) to +2 (very positive):

Provide only the numerical score.

Provide only the numerical score and an explanation. Offer your level of confidence.

AI-Assisted Analysis: Example

Text:

“In the second half, especially after we scored for 1-1, I thought we were really impressive. We created so many opportunities, good chances. Then [the result] feels like a disappointment.”

LLM Response:

The sentiment score is +1 (somewhat positive).

The coach expresses clear satisfaction with the team’s second-half performance (“really impressive”), highlighting positive elements like creating “many opportunities.” However, there’s disappointment about the final result, creating a mixed but predominantly positive sentiment.

AI-Assisted Analysis: Pros and Cons

Advantages:

Sophisticated understanding of context and nuance
Adaptable to any domain without retraining
Can explain reasoning behind classification

Limitations:

Need to prompt sequentually (or use APIs)
Potential for inconsistency across analyses
Less transparent methodology

Selecting the Right Approach

Consider:

Domain specificity required
Available resources (time, computing, budget)
Transparency needs for your analysis – reproducibility, stabiliy important?
Scale of text analysis required: 100 vs 1,000 vs 100,000 pieces of texts

Hybrid approaches often work best: - Start with pre-built libraries but enhance with domain-specific terms for key concepts - compare with AI models (class 06)

Case Study: Football Interview Sentiment

class 05

Look at text and rate – compare with AI

Class 06

RQ1: how does sentiment vary after win, draw and loss
RQ2: Is there a difference between men and women coaches (managers)

Summary

How These Concepts Connect

NLP Pipeline

Raw Text → Tokenization
Tokens → Preprocessing
Clean Tokens → Feature Extraction
Features → Analysis (Sentiment)
Results → Interpretation

From Text to Data (Implementation Steps)

steps

1.** Collect Corpus** (manager interviews)
2. Preprocess (clean, tokenize, possibly POS-tag)
3. Choose Method (BoW + library, advanced or LLM-based approach)
4. Analysis (sentiment score 0-1)
5. Discussion (numeric categories)

key issue

Code vs LLM
Result versus “truth”

Wrap-Up & Discussion

Key Takeaways:

Text → Data → Insight
BoW is a simple start
Loads of advanced models –> Modern NLP methods
LLMs bring powerful context