Large Language Models: Key Concepts
2026-02-16
Teaching Data Analysis courses + prepping for 2nd edition of Data Analysis textbook
This is a class to
Try out different ways to approach a problem
(in progress)
Many great resources available online.
This is the best I have seen:
3blue1brown Neural Network series
Assignment: watch them all.
For a full reading list: Beyond: Readings & Resources
Neural Language Models (2003): First successful application of neural networks to language modeling, establishing the statistical foundations for predicting word sequences based on context.
Word Embeddings (2013): Development of Word2Vec and distributed representations, enabling words to be mapped into vector spaces where semantic relationships are preserved mathematically.
Transformer Architecture (2017): Introduction of the Transformer model with self-attention mechanisms, eliminating sequential computation constraints and enabling efficient parallel processing.
Pretraining + Fine-tuning (2018): BERT - Emergence of the two-stage paradigm where models are first pretrained on vast unlabeled text, then fine-tuned for specific downstream tasks.
ChatGPT (2022): Release of a conversational AI interface that demonstrated unprecedented natural language capabilities to the general public, driving mainstream adoption.
Reinforcement Learning from Human Feedback (2023): Refinement of models through human preferences, aligning AI outputs with human values and reducing harmful responses.
1 token = 4 characters, 4 tokens= 3 words (In English)
Varies by models and keeps growing
ChatGPT 2022 window of 4,000 tokens
2026 models: 200k–1M tokens (see Model Comparison)
A 200k-token window holds your entire WVS codebook, data dictionary, and a full conversation about regression specifications
Tokens matter – more context, more relevant answers
Over limit: hallucinate, off-topic.
Context window = your chat + uploads + retrieved materials
LLMs work much better with knowledge in context window
Think
Context window: grounded knowledge
Outside: good but often vague recollection + internet search
Inference means generating output based on input and learned patterns
See Glossary for this and 30+ other technical terms
Reasoning steps are approximate, not logically guaranteed. The model generates plausible reasoning chains — it can still make errors within them.
Think of it as a student showing their work on an exam. The steps look logical, but you still need to check the answer.
Prompt: "I have panel data on firm exports.
Should I use fixed effects or random effects?"
Standard model: "Use fixed effects." (no explanation)
Reasoning model thinks through:
→ "Panel data, firm-level... are firm characteristics
correlated with the independent variable?"
→ "Likely yes — firm size, location are time-invariant
but correlated with export decisions"
→ "Hausman test would confirm, but FE is safer default"
→ Answer: "Fixed effects, because..." (with reasoning)
Warning: The reasoning chain is helpful but not infallible. Always verify the logic — the model may get the econometric reasoning wrong while sounding confident.
more will be covered in Data Analysis with AI 2
Current model details: Which AI?
The Centaur and Cyborg Approaches based on Co-Intelligence: Living and Working with AI By Ethan Mollick
Co-Intelligence
Image created Claude.ai

Image created in detailed photorealistic style by Ralph Losey with ChatGPT4 Visual Muse version

Image created in detailed photorealistic style by Ralph Losey with ChatGPT4 Visual Muse version
| Stage | Centaur 🧑💻 | Cyborg 🦾 |
|---|---|---|
| Plan | 👤 Design research question & identification strategy 🤖 Suggest variables |
👤🤖 Interactive brainstorming 👤🤖 Collaborative refinement |
| Data Prep | 👤 Define cleaning rules 🤖 Execute cleaning code 👤 Validate |
👤🤖 Iterative cleaning 👤🤖 Joint discovery and modification |
| Analysis | 👤 Choose methods 🤖 Implement code 👤 Validate results |
👤🤖 Exploratory conversation 👤🤖 Dynamic adjustment 👤🤖 Continuous validation |
| Reporting | 👤 Outline findings 🤖 Draft sections 👤 Finalize |
👤🤖 Co-writing process 👤🤖 Real-time feedback 👤🤖 Iterative improvement |

Image created by ChatGPT 5.2
| Era | Model | Role of Human |
|---|---|---|
| 2023-24 | Centaur | doer/checker. Half human, half AI. Human writes code, AI fixes it. |
| 2024-25 | Cyborg | integrated. Constant feedback loop. |
| 2026+ | Orchestrator | manager. Human defines intent, Agents execute, test, and report. |
You don’t need to build models. But understanding a few mechanisms helps you use them better.
Three key methods helped LLMs get much better:
See Glossary for RLHF, MoE, and other terms
More in Data Analysis with AI 2
An agent is an AI system that can:
Think of it as an RA who can work independently overnight. You give the task in the evening, review the output in the morning.
Key difference from chat: In chat, YOU drive every step. With an agent, the AI drives — you set the goal and review.
You say to an agent: “Analyze the gender wage gap in CPS data”
1. Agent reads your system prompt
→ knows you want R, tidyverse, robust SEs
2. Agent searches for & downloads CPS extract
3. Agent cleans data (drops missing wages, creates
log(wage), experience², education dummies)
4. Agent runs Mincer regression: log(wage) ~ female +
educ + exper + exper²
5. Agent adds controls, checks robustness
6. Agent produces publication-ready table
7. Agent writes a short summary of findings
YOU: Review output, check coefficients make sense, verify sample size, iterate.
The agent does 30 minutes of mechanical work in 2 minutes. Your job shifts from doing to reviewing and directing. This is the Orchestrator model in practice.
Multi-agent system:
1. Root Agent
→ Reads your system prompt
→ Breaks down task into sub-tasks
→ Assigns to specialized agents
2. Data Agent
→ Finds CPS data, cleans it, creates variables
3. Analysis Agent
→ Runs regression, checks assumptions
4. Reporting Agent
→ Formats table, writes summary
These agenst are co-ordinated (orchestrated) by the Root Agent, which also manages the context window and ensures each agent has what it needs to do its job.
delete_database) that doesn’t exist or using it wronglyWhy does AI cost what it costs?
See Glossary: Tokens
Image created in detailed photorealistic style by Ralph Losey with ChatGPT4 Visual Muse version
Stochastic = when prompted repeatedly, LLMs may give different answers
Parrot = LLMs can repeat information without understanding
Philosophy = to what extent do they understand the state of the world?

Same prompt, different day → different output.
AI output looks professional and confident — even when wrong.
What happens to data you paste into AI chat?
See Which AI? — Security for tier comparisons
If the AI disappears tomorrow, can you still do your job? If not, you’re relying too much.
Some more ideas
Same as in 2025 Q2
New for 2026 Q1
AI-centric Let AI plan, execute, and report. You review and supervise all steps.
Human-centric You think. AI suggests. Iterate on plan. Execution is mixed. Full review, multiple rounds likely.
Rule of thumb: If you can’t tell whether the AI output is right or wrong, you shouldn’t be using AI for that task yet.
All major platforms now offer a “deep research” or “research” mode:
U.S. Copyright Office 2025 Jan report Copyright and Artificial Intelligence Part 2: Copyrightability: copyright protection is intended for human-created works.
Note
“Do not provide sufficient human control to make users of an AI system the authors of the output. Prompts essentially function as instructions that convey unprotectable ideas. While highly detailed prompts could contain the user’s desired expressive elements, at present they do not control how the AI system processes them in generating the output.”
Note
Artificial intelligence software, such as chatbots or other large language models, may not be listed as an author. If artificial intelligence software was used in the preparation of the manuscript, including drafting or editing text, this must be briefly described during the submission process, which will help us understand how authors are or are not using chatbots or other forms of artificial intelligence. Authors are solely accountable for, and must thoroughly fact-check, outputs created with the help of artificial intelligence software.
Two key points from Elsevier policy generative AI policies for journals
Source: McKinsey Digital
AI as input, supervision, debugging, responsibility.
Without core knowledge you can’t interact
Strong knowledge and experience helps debugging
Gabors Data Analysis with AI - 2026-Q1 v0.6.2