Data Analysis with AI course

05 Appendix to Using Text as Data

Gábor Békés (CEU)

2025-03-24

Appendix

Domain-Specific Lexicons: Implementation

def sports_sentiment(text):
    # Tokenize and clean text
    words = preprocess(text)
    
    # Check for negations and intensifiers
    for i, word in enumerate(words):
        # Context-aware scoring logic
        # e.g., "not disappointed" → positive
        # e.g., "very clinical" → strongly positive
    
    # Calculate sports-specific sentiment score
    return score

BERT for Sentiment Analysis

Bidirectional Encoder Representations from Transformers
Understands context and word relationships
Pre-trained on massive text corpus
Fine-tuned for sentiment classification

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")

inputs = tokenizer("The team showed great character today.", return_tensors="pt")
outputs = model(**inputs)

Approach 3: Machine Learning Models

Types of ML approaches: - Traditional ML, Deep Learning (CNN) - Transformer Models: BERT, RoBERTa – in practice

from transformers import pipeline

classifier = pipeline('sentiment-analysis')
classifier("We were really impressive in the second half")
# [{'label': 'POSITIVE', 'score': 0.9998}]

Machine Learning Models: Pros and Cons

Advantages: - Capture complex patterns and relationships - Learn from data rather than rules - Better contextual understanding - Can improve with more training examples

Limitations: - Require labeled training data - Computationally expensive - “Black box” decision making - Domain adaptation challenges

Using Large Language Models (LLMs): - GPT-4, Claude, etc. for sentiment analysis - Feed individual text examples for classification - Ask for nuanced, multi-dimensional analysis

def ai_sentiment_analysis(text):
    prompt = f"""
    Rate the sentiment of this football manager interview on a 
    scale from -2 (very negative) to +2 (very positive):
    
    "{text}"
    
    Provide only the numerical score.
    """
    response = llm_api_call(prompt)
    return parse_response(response)

Comparison of Approaches

Feature	Pre-built Libraries	Custom Lexicons	ML Models	AI-Assisted
Setup Complexity	Low	Medium	High	Low
Domain Specificity	Low	High	Medium-High	High
Contextual Understanding	Medium	Medium	High	Very High
Transparency	High	High	Low	Medium
Scalability	High	Medium	High	Low-Medium
Cost	Free	Development Time	Computing Resources	API Costs

Selecting the Right Approach

Consider:

Domain specificity required
Available resources (time, computing, budget)
Transparency needs for your analysis
Scale of text analysis required

Hybrid approaches often work best:

Use pre-built libraries for initial analysis
Enhance with domain-specific terms for key concepts
Validate with ML models or AI assistance for edge cases