Data Analysis with AI course

05 Appendix to Using Text as Data

Gábor Békés (CEU)

2025-03-24

Appendix

Domain-Specific Lexicons: Implementation

def sports_sentiment(text):
    # Tokenize and clean text
    words = preprocess(text)
    
    # Check for negations and intensifiers
    for i, word in enumerate(words):
        # Context-aware scoring logic
        # e.g., "not disappointed" → positive
        # e.g., "very clinical" → strongly positive
    
    # Calculate sports-specific sentiment score
    return score

BERT for Sentiment Analysis

  • Bidirectional Encoder Representations from Transformers
  • Understands context and word relationships
  • Pre-trained on massive text corpus
  • Fine-tuned for sentiment classification
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")

inputs = tokenizer("The team showed great character today.", return_tensors="pt")
outputs = model(**inputs)

Approach 3: Machine Learning Models

Types of ML approaches: - Traditional ML, Deep Learning (CNN) - Transformer Models: BERT, RoBERTa – in practice

from transformers import pipeline

classifier = pipeline('sentiment-analysis')
classifier("We were really impressive in the second half")
# [{'label': 'POSITIVE', 'score': 0.9998}]

Machine Learning Models: Pros and Cons

Advantages: - Capture complex patterns and relationships - Learn from data rather than rules - Better contextual understanding - Can improve with more training examples

Limitations: - Require labeled training data - Computationally expensive - “Black box” decision making - Domain adaptation challenges

Using Large Language Models (LLMs): - GPT-4, Claude, etc. for sentiment analysis - Feed individual text examples for classification - Ask for nuanced, multi-dimensional analysis

def ai_sentiment_analysis(text):
    prompt = f"""
    Rate the sentiment of this football manager interview on a 
    scale from -2 (very negative) to +2 (very positive):
    
    "{text}"
    
    Provide only the numerical score.
    """
    response = llm_api_call(prompt)
    return parse_response(response)

Comparison of Approaches

Feature Pre-built Libraries Custom Lexicons ML Models AI-Assisted
Setup Complexity Low Medium High Low
Domain Specificity Low High Medium-High High
Contextual Understanding Medium Medium High Very High
Transparency High High Low Medium
Scalability High Medium High Low-Medium
Cost Free Development Time Computing Resources API Costs

Selecting the Right Approach

Consider:

  1. Domain specificity required
  2. Available resources (time, computing, budget)
  3. Transparency needs for your analysis
  4. Scale of text analysis required

Hybrid approaches often work best:

  • Use pre-built libraries for initial analysis
  • Enhance with domain-specific terms for key concepts
  • Validate with ML models or AI assistance for edge cases