From Tokens to LLMs, how to ‘understand’ textual information A short intro.
2026-04-21
Language is rich — synonyms, variants, grammar, word order, sarcasm, context.
Classical NLP is about throwing complexity away, carefully:
This is cheating. But it’s honest, well-named cheating: every step has a name, and every name describes what got thrown out.
Part A — the classical pipeline
Tokenize → clean → count → weight. You’ll be able to read a paper that says “we TF-IDF’d a corpus of 10k articles” and know exactly what that means.
Part B — modern tools (the last few slides)
Word embeddings, contextual embeddings (BERT), and LLMs — each preserves more of what the classical pipeline throws away.
Punchline. LLMs (Claude, GPT, Gemini) automate most of Part A for you. Knowing what they automate is what makes you a competent user instead of a cargo-cult one.
“We lost control of the game for five minutes. But my feeling is that the players tried, really tried. They tried during the week and they tried today. In the past, if we had this kind of bad five minutes and we suffered two goals, we didn’t recover. Today is a different feeling.”
— Ruben Amorim, Man United manager, post-match, Nottingham Forest 2–2 Man United, 1 Nov 2025.
Short enough to walk through by hand. Long enough to see every step matter.
Raw text
│
▼
Tokenization ← split into units
│
▼
Normalization ← lowercase, strip punctuation, stopwords
│
▼
Stemming / Lemmatization ← collapse word variants
│
▼
Feature extraction ← Bag of Words, n-grams, TF-IDF
│
▼
Analysis ← counts, classification, LLM scoring
Each step is a choice. Different choices → different numbers → different answers.
Tokenization breaks a string of text into smaller units — tokens. Usually words, sometimes punctuation, sometimes sub-word pieces.
Our first sentence:
“We lost control of the game for five minutes.”
Split on spaces.
Problem: the last token is "minutes." — the period is glued on. "minutes" and "minutes." look different to a computer.
Libraries like NLTK or spaCy handle punctuation.
The period is its own token. Keep it or drop it — your choice.
Modern LLMs use sub-word tokenizers like Byte-Pair Encoding (BPE).
"Nottingham" might become "Not" + "tingham"."Amorim" might be 3–4 pieces.Why it matters:
Rule of thumb (English): 1 token ≈ 0.75 words ≈ 4 characters.
"We" and "we" are the same word for most purposes.
When NOT to lowercase:
"United" (club) vs. "united" (adjective)"AMAZING" vs. "amazing"Cheap, common, usually safe.
Stopwords = very common words carrying little standalone meaning: the, of, a, is, we, and.
10 tokens → 5. The gist — “lost control of the game” — survives.
Warning
Stopwords are not always safe to drop.
"not" is a stopword in most lists."not recover" and "recover" mean opposite things.Words come in variants:
try, tried, trying, tries
For most counting tasks, you want them as one thing.
A stemmer chops off endings with simple rules. Fast, crude, sometimes produces non-words.
"tried" → "tri". Not a real word — but all forms collapse to the same token.
A lemmatizer uses a dictionary to return the proper base form (the lemma). Slower, more accurate.
Real words: "try", "recover".
| Stemming | Lemmatization | |
|---|---|---|
| Speed | Fast | Slower |
| Output | Sometimes non-words | Real dictionary words |
| Accuracy | Lower | Higher |
| Needs POS tag? | No | Often yes |
| Typical use | Big corpora, search | Research, analysis |
Rule: small research corpus → lemmatize. Billion-doc search → stem.
After tokenizing, lowercasing, dropping punctuation and stopwords, and lemmatizing verbs:
['lose', 'control', 'game', 'five', 'minute', 'feel',
'player', 'try', 'really', 'try', 'try', 'week', 'try',
'today', 'past', 'kind', 'bad', 'five', 'minute',
'suffer', 'two', 'goal', 'recover', 'today',
'different', 'feel']
65 words → 26 tokens.
Already you can see the shape of the text:
Bag of Words (BoW) represents a document as its tokens and their counts — ignoring grammar and order.
Python dict, pandas Series, sparse vector — same idea.
| token | count |
|---|---|
| try | 4 |
| minute | 2 |
| today | 2 |
| feel | 2 |
| five | 2 |
| lose | 1 |
| control | 1 |
| game | 1 |
| player | 1 |
| suffer | 1 |
| goal | 1 |
| recover | 1 |
| different | 1 |
"We lost control" = "control we lost" (same bag)."did not recover" counts "recover" as appearing."Premier League" becomes two separate tokens."bad five minutes" and "bad loss" both contribute one "bad".The next three sections patch one hole each.
An n-gram = a sequence of n consecutive tokens.
From Amorim (cleaned):
("lose", "control"), ("really", "try"), ("bad", "minute"), ("different", "feel")("try", "really", "try"), ("mental", "strength", "secure")"not recover" is its own feature, distinct from "recover"."premier league", "manchester united"."under pressure", "on the brink".Vocabulary explodes.
ngram_range=(1, 2) = unigrams + bigrams. min_df=2 = drop anything in fewer than 2 documents.
BoW and n-grams tell you what’s in a document. They don’t tell you what’s distinctive.
Every post-match article mentions game, half, team, player. Those tell you nothing about a specific article. Noise dressed up as signal.
TF-IDF weights each token by how often it appears in this document, down-weighted by how often it appears across all documents.
log(N / DF) where N = total docsWord in every doc: DF = N → IDF = 0 → weight zero. Word in one doc out of 10,000 with several hits → very high.
Imagine a corpus of 500 post-match articles.
Low TF-IDF (common across all): game, half, team, player, manager, coach, goal
High TF-IDF (distinctive to the Amorim article): try, recover, mental strength, five minutes, different feeling
TF-IDF does what a human skim-reader does: ignore boilerplate, notice the unusual.
Same interface as CountVectorizer, different weighting.
Sparse features = every token is its own dimension with no built-in relationship.
"manager", "coach", "boss" = three unrelated features."bank" (finance) = "bank" (river)."we did not, in the end, recover" defeats bigrams.That’s what the modern tools are for.
Give every word a vector of numbers (typically 100–300 dimensions).
Famous example: king − man + woman ≈ queen The vector arithmetic actually works on real pre-trained embeddings.
Two classic methods:
"manager", "coach", "boss" now have close vectors.The limit:
One vector per word, no matter the sentence. "bank" has one vector — the finance one and the river one get blurred together.
BERT (Devlin et al., Google, 2018) — the Transformer revolution, before LLMs.
"bank" near money vs. "bank" near river → different vectors.Still the backbone of many production NLP systems — quiet, fast, cheap once trained.
Claude, GPT, Gemini. Generative Transformers with tens to hundreds of billions of parameters.
What’s genuinely new is context awareness.
A hierarchy of how much context each approach sees:
LLMs pick up the things classical pipelines cannot even see:
Project 2: you’ll use Claude to score news articles for how much a manager change was expected — the model reads each article with all this context in play.
| Capability | Classical | Embeddings | LLMs |
|---|---|---|---|
| Speed | very fast | fast | slow |
| Synonyms | no | yes | yes |
| Context | no | some (BERT) | yes |
| World knowledge | no | no | yes |
| Cost | ~free | cheap | $ per call |
| Transparency | high | medium | low |
| Good baseline for | anything | similarity tasks | everything but cheap |
When you call an LLM, you skip steps 1–6. That’s fine — but knowing what you skipped helps you reason about why the LLM gave you the answer it did, and where it might fail.
You will not write a tokenizer or a TF-IDF pipeline — you’ll send article text to an LLM and ask for a number.
But the LLM still does all these steps inside.
A Transformer tokenizes (BPE), embeds, weights tokens by importance, averages across layers — at every layer, for every token. It just does them context-aware: no step collapses the text into a bag. Who said what, when, and why travels with every token.
The LLM doesn’t skip the pipeline — it runs a smarter, context-aware version of it for you.
You will need to edit, review, and prompt the LLM to get good scores. Knowing what the LLM is doing under the hood helps you do that effectively.
| Step / approach | What it does | Typical library |
|---|---|---|
| Tokenization | Split text into tokens | nltk, spacy |
| Stopword removal | Drop the, of, is… | nltk.corpus.stopwords |
| Stemming | "trying" → "tri" |
PorterStemmer |
| Lemmatization | "trying" → "try" |
WordNetLemmatizer, spacy |
| BoW | Count tokens per doc | CountVectorizer |
| N-grams | Count token sequences | CountVectorizer(ngram_range=...) |
| TF-IDF | Count × inverse doc frequency | TfidfVectorizer |
| Word embeddings | Word → vector (similar words close) | gensim, pre-trained GloVe |
| Contextual embeddings | Word → vector (context-dependent) | transformers (BERT models) |
| LLMs | Task directly from a prompt | anthropic, openai, google-genai |
Gabors Data Analysis with AI — NLP Basics, v2.1