Data Analysis with AI course: NLP Basics

From Tokens to LLMs, how to ‘understand’ textual information A short intro.

Gábor Békés (CEU)

2026-04-21

Research with text

Political speeches, earnings calls, interviews, news, policy papers, reviews, tweets.
Text is the raw material for a huge share of modern economics, finance, political science, sociology.
But you can’t run a regression on a paragraph.
To analyze at scale, you have to turn text into numbers.

The simplification trick

Language is rich — synonyms, variants, grammar, word order, sarcasm, context.

Classical NLP is about throwing complexity away, carefully:

Keep what matters for your question — frequencies, distinctive terms, broad topic.
Drop what you can live without — inflection, casing, exact ordering.

This is cheating. But it’s honest, well-named cheating: every step has a name, and every name describes what got thrown out.

What we’ll cover today

Part A — the classical pipeline

Tokenize → clean → count → weight. You’ll be able to read a paper that says “we TF-IDF’d a corpus of 10k articles” and know exactly what that means.

Part B — modern tools (the last few slides)

Word embeddings, contextual embeddings (BERT), and LLMs — each preserves more of what the classical pipeline throws away.

Punchline. LLMs (Claude, GPT, Gemini) automate most of Part A for you. Knowing what they automate is what makes you a competent user instead of a cargo-cult one.

Our running example

“We lost control of the game for five minutes. But my feeling is that the players tried, really tried. They tried during the week and they tried today. In the past, if we had this kind of bad five minutes and we suffered two goals, we didn’t recover. Today is a different feeling.”

— Ruben Amorim, Man United manager, post-match, Nottingham Forest 2–2 Man United, 1 Nov 2025.

Short enough to walk through by hand. Long enough to see every step matter.

The pipeline

Raw text
   │
   ▼
Tokenization         ← split into units
   │
   ▼
Normalization        ← lowercase, strip punctuation, stopwords
   │
   ▼
Stemming / Lemmatization   ← collapse word variants
   │
   ▼
Feature extraction   ← Bag of Words, n-grams, TF-IDF
   │
   ▼
Analysis             ← counts, classification, LLM scoring

Each step is a choice. Different choices → different numbers → different answers.

What is tokenization?

Tokenization breaks a string of text into smaller units — tokens. Usually words, sometimes punctuation, sometimes sub-word pieces.

Our first sentence:

“We lost control of the game for five minutes.”

Whitespace tokenization

Split on spaces.

text = "We lost control of the game for five minutes."
tokens = text.split()
# ['We', 'lost', 'control', 'of', 'the', 'game',
#  'for', 'five', 'minutes.']

Problem: the last token is "minutes." — the period is glued on. "minutes" and "minutes." look different to a computer.

Word tokenization (a library)

Libraries like NLTK or spaCy handle punctuation.

from nltk.tokenize import word_tokenize

word_tokenize("We lost control of the game for five minutes.")
# ['We', 'lost', 'control', 'of', 'the', 'game',
#  'for', 'five', 'minutes', '.']

The period is its own token. Keep it or drop it — your choice.

Sub-word tokenization (what LLMs use)

Modern LLMs use sub-word tokenizers like Byte-Pair Encoding (BPE).

"Nottingham" might become "Not" + "tingham".
A rare name like "Amorim" might be 3–4 pieces.

Why it matters:

API cost and limits are in tokens, not words.
Models handle misspellings and new words without blowing up.
You do not tokenize yourself when calling an LLM API.

Rule of thumb (English): 1 token ≈ 0.75 words ≈ 4 characters.

Lowercasing

"We" and "we" are the same word for most purposes.

[t.lower() for t in tokens]
# ['we', 'lost', 'control', 'of', 'the', 'game',
#  'for', 'five', 'minutes', '.']

When NOT to lowercase:

Named entities: "United" (club) vs. "united" (adjective)
Sentiment cues: "AMAZING" vs. "amazing"

Removing punctuation

[t for t in tokens if t.isalpha()]
# ['we', 'lost', 'control', 'of', 'the', 'game',
#  'for', 'five', 'minutes']

Cheap, common, usually safe.

Removing stopwords

Stopwords = very common words carrying little standalone meaning: the, of, a, is, we, and.

from nltk.corpus import stopwords
stop = set(stopwords.words('english'))

[t for t in tokens if t not in stop]
# ['lost', 'control', 'game', 'five', 'minutes']

10 tokens → 5. The gist — “lost control of the game” — survives.

Stopwords: a warning

Warning

Stopwords are not always safe to drop.

"not" is a stopword in most lists.
But "not recover" and "recover" mean opposite things.
For any analysis that cares about negation: keep stopwords.

Stemming vs. lemmatization — the problem

Words come in variants:

try, tried, trying, tries

For most counting tasks, you want them as one thing.

Stemming

A stemmer chops off endings with simple rules. Fast, crude, sometimes produces non-words.

from nltk.stem import PorterStemmer
stem = PorterStemmer()

[stem.stem(w) for w in ['tried', 'trying', 'tries',
                        'recover', 'recovered']]
# ['tri', 'tri', 'tri', 'recov', 'recov']

"tried" → "tri". Not a real word — but all forms collapse to the same token.

Lemmatization

A lemmatizer uses a dictionary to return the proper base form (the lemma). Slower, more accurate.

from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()

[lem.lemmatize(w, pos='v') for w in
   ['tried', 'trying', 'tries', 'recover', 'recovered']]
# ['try', 'try', 'try', 'recover', 'recover']

Real words: "try", "recover".

Which to use?

	Stemming	Lemmatization
Speed	Fast	Slower
Output	Sometimes non-words	Real dictionary words
Accuracy	Lower	Higher
Needs POS tag?	No	Often yes
Typical use	Big corpora, search	Research, analysis

Rule: small research corpus → lemmatize. Billion-doc search → stem.

Applied to the Amorim quote

After tokenizing, lowercasing, dropping punctuation and stopwords, and lemmatizing verbs:

['lose', 'control', 'game', 'five', 'minute', 'feel',
 'player', 'try', 'really', 'try', 'try', 'week', 'try',
 'today', 'past', 'kind', 'bad', 'five', 'minute',
 'suffer', 'two', 'goal', 'recover', 'today',
 'different', 'feel']

65 words → 26 tokens.

Already you can see the shape of the text:

try × 4
minute, today, feel, five × 2 each
the words lose, recover, different appear once but carry the story

What is Bag of Words?

Bag of Words (BoW) represents a document as its tokens and their counts — ignoring grammar and order.

Python dict, pandas Series, sparse vector — same idea.

Amorim as a bag

token	count
try	4
minute	2
today	2
feel	2
five	2
lose	1
control	1
game	1
player	1
suffer	1
goal	1
recover	1
different	1

What you can do with a BoW

Summarize a document at a glance
Compare documents (cosine similarity)
Classify documents (logistic regression on BoW is a strong baseline)
Track change (this week vs. last)

What BoW throws away

Word order. "We lost control" = "control we lost" (same bag).
Negation. "did not recover" counts "recover" as appearing.
Multi-word expressions. "Premier League" becomes two separate tokens.
Context. "bad five minutes" and "bad loss" both contribute one "bad".

The next three sections patch one hole each.

What is an n-gram?

An n-gram = a sequence of n consecutive tokens.

Bigrams = pairs
Trigrams = triples

From Amorim (cleaned):

bigrams: ("lose", "control"), ("really", "try"), ("bad", "minute"), ("different", "feel")
trigrams: ("try", "really", "try"), ("mental", "strength", "secure")

What n-grams buy you

Local word order: "not recover" is its own feature, distinct from "recover".
Phrases: "premier league", "manchester united".
Idioms: "under pressure", "on the brink".

The cost

Vocabulary explodes.

10,000 unigrams → easily 200,000 bigrams, most appearing once.
You’ll typically cap features by minimum document frequency (e.g., keep only bigrams in ≥ 5 documents).

N-grams in code

from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(
    ngram_range=(1, 2),
    min_df=2,
    stop_words='english',
)
X = vec.fit_transform(documents)

ngram_range=(1, 2) = unigrams + bigrams. min_df=2 = drop anything in fewer than 2 documents.

TF-IDF — the motivation

BoW and n-grams tell you what’s in a document. They don’t tell you what’s distinctive.

Every post-match article mentions game, half, team, player. Those tell you nothing about a specific article. Noise dressed up as signal.

What TF-IDF does

TF-IDF weights each token by how often it appears in this document, down-weighted by how often it appears across all documents.

TF (term frequency): count of term t in doc d
DF (document frequency): how many docs contain t
IDF: log(N / DF) where N = total docs
TF-IDF = TF × IDF

Word in every doc: DF = N → IDF = 0 → weight zero. Word in one doc out of 10,000 with several hits → very high.

Applied to our setup

Imagine a corpus of 500 post-match articles.

Low TF-IDF (common across all): game, half, team, player, manager, coach, goal

High TF-IDF (distinctive to the Amorim article): try, recover, mental strength, five minutes, different feeling

TF-IDF does what a human skim-reader does: ignore boilerplate, notice the unusual.

TF-IDF in code

from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer(
    ngram_range=(1, 2),
    min_df=5,
    stop_words='english',
)
X = vec.fit_transform(documents)

Same interface as CountVectorizer, different weighting.

What the classical pipeline can’t do

Sparse features = every token is its own dimension with no built-in relationship.

No synonyms. "manager", "coach", "boss" = three unrelated features.
No context. "bank" (finance) = "bank" (river).
No world knowledge. Model doesn’t know “two quickfire goals” is bad.
Negation is brittle. "we did not, in the end, recover" defeats bigrams.

That’s what the modern tools are for.

Word embeddings — the idea

Give every word a vector of numbers (typically 100–300 dimensions).

Similar words → similar vectors.
Geometry encodes meaning.

Famous example: king − man + woman ≈ queen The vector arithmetic actually works on real pre-trained embeddings.

Two classic methods:

Word2Vec (Mikolov et al., Google, 2013)
GloVe (Pennington et al., Stanford, 2014)

Word embeddings — what they solve

"manager", "coach", "boss" now have close vectors.
You can compute similarity between words, sentences, whole documents.
Average word vectors in a document → a document vector → surprisingly good features.

The limit:

One vector per word, no matter the sentence. "bank" has one vector — the finance one and the river one get blurred together.

Contextual embeddings — BERT

BERT (Devlin et al., Google, 2018) — the Transformer revolution, before LLMs.

Each word gets a different vector in every sentence it appears in.
"bank" near money vs. "bank" near river → different vectors.
Pre-trained on massive text, then fine-tuned for specific tasks (sentiment, classification, NER).

Still the backbone of many production NLP systems — quiet, fast, cheap once trained.

LLMs — the current leap

Claude, GPT, Gemini. Generative Transformers with tens to hundreds of billions of parameters.

What’s genuinely new is context awareness.

A hierarchy of how much context each approach sees:

Word embeddings — one vector per word. No context (bank = bank).
BERT — different vector per sentence. Local context (money-bank vs. river-bank).
LLMs — the whole document, the prompt, and the system message all inform every token.

What “context” buys you

LLMs pick up the things classical pipelines cannot even see:

Discourse context — “a year to the day since he was appointed” reads as an anniversary / patience-running-out framing.
World context — they know Man United have fired two managers in 18 months.
Pragmatic context — a calm tone after a bad draw reads as “whistling past the graveyard”, not genuine calm.
Negation, irony, comparisons — handled natively, no bigram hacks needed.

Project 2: you’ll use Claude to score news articles for how much a manager change was expected — the model reads each article with all this context in play.

What each approach gives you

Capability	Classical	Embeddings	LLMs
Speed	very fast	fast	slow
Synonyms	no	yes	yes
Context	no	some (BERT)	yes
World knowledge	no	no	yes
Cost	~free	cheap	$ per call
Transparency	high	medium	low
Good baseline for	anything	similarity tasks	everything but cheap

Classical tools are NOT obsolete

Fast — milliseconds vs. seconds per document.
Transparent — you can read every feature.
Cheap — no API calls, no GPUs.
Strong baseline — often within a few points of expensive models on classification.

When you call an LLM, you skip steps 1–6. That’s fine — but knowing what you skipped helps you reason about why the LLM gave you the answer it did, and where it might fail.

Project 2: text → expectation scores

You will not write a tokenizer or a TF-IDF pipeline — you’ll send article text to an LLM and ask for a number.

But the LLM still does all these steps inside.

A Transformer tokenizes (BPE), embeds, weights tokens by importance, averages across layers — at every layer, for every token. It just does them context-aware: no step collapses the text into a bag. Who said what, when, and why travels with every token.

The LLM doesn’t skip the pipeline — it runs a smarter, context-aware version of it for you.

You will need to edit, review, and prompt the LLM to get good scores. Knowing what the LLM is doing under the hood helps you do that effectively.

Quick-reference cheatsheet

Step / approach	What it does	Typical library
Tokenization	Split text into tokens	`nltk`, `spacy`
Stopword removal	Drop the, of, is…	`nltk.corpus.stopwords`
Stemming	`"trying"` → `"tri"`	`PorterStemmer`
Lemmatization	`"trying"` → `"try"`	`WordNetLemmatizer`, `spacy`
BoW	Count tokens per doc	`CountVectorizer`
N-grams	Count token sequences	`CountVectorizer(ngram_range=...)`
TF-IDF	Count × inverse doc frequency	`TfidfVectorizer`
Word embeddings	Word → vector (similar words close)	`gensim`, pre-trained GloVe
Contextual embeddings	Word → vector (context-dependent)	`transformers` (BERT models)
LLMs	Task directly from a prompt	`anthropic`, `openai`, `google-genai`

Key takeaways

Text → tokens → cleaned tokens → features → analysis.
Each step is a choice. Name the step, own the choice.
Bag of Words is the honest baseline. BoW + n-grams + TF-IDF is often enough.
Embeddings add meaning; BERT adds context; LLMs add world knowledge.
LLMs don’t replace this pipeline — they automate most of it.
Knowing what the classical pipeline does tells you where the LLM might fail.