Large Language Models: Key Concepts
2026-01-12
Teaching Data Analysis courses + prepping for 2nd edition of Data Analysis textbook
This is a class to
Try out different ways to approach a problem
(in progress)
Many great resources available online.
This is the best I have seen:
3blue1brown Neural Network series
Assignment: watch them all.
Neural Language Models (2003): First successful application of neural networks to language modeling, establishing the statistical foundations for predicting word sequences based on context.
Word Embeddings (2013): Development of Word2Vec and distributed representations, enabling words to be mapped into vector spaces where semantic relationships are preserved mathematically.
Transformer Architecture (2017): Introduction of the Transformer model with self-attention mechanisms, eliminating sequential computation constraints and enabling efficient parallel processing.
Pretraining + Fine-tuning (2018): BERT - Emergence of the two-stage paradigm where models are first pretrained on vast unlabeled text, then fine-tuned for specific downstream tasks.
ChatGPT (2022): Release of a conversational AI interface that demonstrated unprecedented natural language capabilities to the general public, driving mainstream adoption.
Reinforcement Learning from Human Feedback (2023): Refinement of models through human preferences, aligning AI outputs with human values and reducing harmful responses.
1 token = 4 characters, 4 tokens= 3 words (In English)
varies by models
ChatGPT 2022 window of 4,000 tokens.
GPT-5.2 2026 window of 400,000 tokens = 1000p book
Gemini 3 2026 window of 1 million tokens = the whole Harry Potter series
Tokens matter – more context, more relevant answers
Over limit: hallucinate, off-topic.
Context window = your chat + uploads + retrieved materials
LLMs work much better with knowledge in context window
Think
Context window: grounded knowledge
Outside: good but often vague recollection + internet search
Inference means generating output based on input and learned patterns
more will be covered in Data Analysis with AI 2
The Centaur and Cyborg Approaches based on Co-Intelligence: Living and Working with AI By Ethan Mollick
Co-Intelligence
Image created Claude.ai

Image created in detailed photorealistic style by Ralph Losey with ChatGPT4 Visual Muse version

Image created in detailed photorealistic style by Ralph Losey with ChatGPT4 Visual Muse version
| Stage | Centaur 🧑💻 | Cyborg 🦾 |
|---|---|---|
| Plan | 👤 Design research plan 🤖 Suggest variables |
👤🤖 Interactive brainstorming 👤🤖 Collaborative refinement |
| Data Prep | 👤 Define cleaning rules 🤖 Execute cleaning code 👤 Validate |
👤🤖 Iterative cleaning 👤🤖 Joint discovery and modification |
| Analysis | 👤 Choose methods 🤖 Implement code 👤 Validate results |
👤🤖 Exploratory conversation 👤🤖 Dynamic adjustment 👤🤖 Continuous validation |
| Reporting | 👤 Outline findings 🤖 Draft sections 👤 Finalize |
👤🤖 Co-writing process 👤🤖 Real-time feedback 👤🤖 Iterative improvement |

Image created by ChatGPT 5.2
| Era | Model | Role of Human |
|---|---|---|
| 2023-24 | Centaur | doer/checker. Half human, half AI. Human writes code, AI fixes it. |
| 2024-25 | Cyborg | integrated. Constant feedback loop. |
| 2026+ | Orchestrator | manager. Human defines intent, Agents execute, test, and report. |
Image created in detailed photorealistic style by Ralph Losey with ChatGPT4 Visual Muse version
Stochastic = when prompted repeatedly, LLMs may give different answers
Parrot = LLMs can repeat information without understanding
Philosophy = to what extent do they understand the state of the world?

Some more ideas
Same as in 2025 Q2
New for 2026 Q1
U.S. Copyright Office 2025 Jan report Copyright and Artificial Intelligence Part 2: Copyrightability: copyright protection is intended for human-created works.
Note
“Do not provide sufficient human control to make users of an AI system the authors of the output. Prompts essentially function as instructions that convey unprotectable ideas. While highly detailed prompts could contain the user’s desired expressive elements, at present they do not control how the AI system processes them in generating the output.”
Note
Artificial intelligence software, such as chatbots or other large language models, may not be listed as an author. If artificial intelligence software was used in the preparation of the manuscript, including drafting or editing text, this must be briefly described during the submission process, which will help us understand how authors are or are not using chatbots or other forms of artificial intelligence. Authors are solely accountable for, and must thoroughly fact-check, outputs created with the help of artificial intelligence software.
Two key points from Elsevier policy generative AI policies for journals
Source: McKinsey Digital
AI as input, supervision, debugging, responsibility.
Without core knowledge you can’t interact
Strong knowledge and experience helps debugging
Gabors Data Analysis with AI - 2026-Q1 v0.5.3