05 Appendix to Using Text as Data
2025-03-24
def sports_sentiment(text):
# Tokenize and clean text
words = preprocess(text)
# Check for negations and intensifiers
for i, word in enumerate(words):
# Context-aware scoring logic
# e.g., "not disappointed" → positive
# e.g., "very clinical" → strongly positive
# Calculate sports-specific sentiment score
return score
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
inputs = tokenizer("The team showed great character today.", return_tensors="pt")
outputs = model(**inputs)
Types of ML approaches: - Traditional ML, Deep Learning (CNN) - Transformer Models: BERT, RoBERTa – in practice
Advantages: - Capture complex patterns and relationships - Learn from data rather than rules - Better contextual understanding - Can improve with more training examples
Limitations: - Require labeled training data - Computationally expensive - “Black box” decision making - Domain adaptation challenges
Using Large Language Models (LLMs): - GPT-4, Claude, etc. for sentiment analysis - Feed individual text examples for classification - Ask for nuanced, multi-dimensional analysis
Feature | Pre-built Libraries | Custom Lexicons | ML Models | AI-Assisted |
---|---|---|---|---|
Setup Complexity | Low | Medium | High | Low |
Domain Specificity | Low | High | Medium-High | High |
Contextual Understanding | Medium | Medium | High | Very High |
Transparency | High | High | Low | Medium |
Scalability | High | Medium | High | Low-Medium |
Cost | Free | Development Time | Computing Resources | API Costs |
Consider:
Hybrid approaches often work best:
Gabors Data Analysis with AI - 2025-03-24, v0.1