Large Language Models: Key Concepts
2025-04-21
Teaching Data Analysis courses + prepping for 2nd edition of Data Analysis textbook
AI is both amazing help and scary as #C!*
This is a class to
Try out different ways to approach a problem
This is designed as first slideshow in a six week course
All open source at github.com/gabors-data-analysis/da-w-ai
Neural Language Models (2003): First successful application of neural networks to language modeling, establishing the statistical foundations for predicting word sequences based on context.
Word Embeddings (2013): Development of Word2Vec and distributed representations, enabling words to be mapped into vector spaces where semantic relationships are preserved mathematically.
Transformer Architecture (2017): Introduction of the Transformer model with self-attention mechanisms, eliminating sequential computation constraints and enabling efficient parallel processing.
Pretraining + Fine-tuning (2018): BERT - Emergence of the two-stage paradigm where models are first pretrained on vast unlabeled text, then fine-tuned for specific downstream tasks.
ChatGPT (2022): Release of a conversational AI interface that demonstrated unprecedented natural language capabilities to the general public, driving mainstream adoption.
Reinforcement Learning from Human Feedback (2023): Refinement of models through human preferences, aligning AI outputs with human values and reducing harmful responses.
Tokenization example showing how text is processed
Note
Larger context = Better understanding but higher computational cost
1 token = 4 characters, 4 tokens= 3 words (In English)
varies by models
ChatGPT 2022 window of 4,000 tokens.
ChatGPT 2025 window of 128,000 tokens = 250p book
Tokens matter – more context, more relevant answers
The Centaur and Cyborg Approaches based on Co-Intelligence: Living and Working with AI By Ethan Mollick
Co-Intelligence
Image created Claude.ai
Image created in detailed photorealistic style by Ralph Losey with ChatGPT4 Visual Muse version
Image created in detailed photorealistic style by Ralph Losey with ChatGPT4 Visual Muse version
Stage | Centaur 🧑💻 | Cyborg 🦾 |
---|---|---|
Plan | 👤 Design research plan 🤖 Suggest variables |
👤🤖 Interactive brainstorming 👤🤖 Collaborative refinement |
Data Prep | 👤 Define cleaning rules 🤖 Execute cleaning code 👤 Validate |
👤🤖 Iterative cleaning 👤🤖 Joint discovery and modification |
Analysis | 👤 Choose methods 🤖 Implement code 👤 Validate results |
👤🤖 Exploratory conversation 👤🤖 Dynamic adjustment 👤🤖 Continuous validation |
Reporting | 👤 Outline findings 🤖 Draft sections 👤 Finalize |
👤🤖 Co-writing process 👤🤖 Real-time feedback 👤🤖 Iterative improvement |
Prompt as small task
Built into coding
Specialized tools (ChatGPT Canvas, Claude Projects)
Anthropic “prompt generator” to optimize the prompts that via Anthropic Console Dashboard (click “Generate a Prompt”).
Agents
Workspace | Key Features |
---|---|
Anthropic Claude Artifacts | • Dedicated output window • Supports text, code, flowchart, SVG, website • Real-time refinement and modification • Sharing and remixing capabilities |
ChatGPT Canvas | • Separate collaboration window • Text editing and coding capabilities • Options for edits, length adjustment • Code review and porting features |
OpenAI Advanced Data Analysis | • Data upload and analysis • Visualization capabilities • Python code execution in back end • Error correction and refinement |
Source: Korinek “Generative AI for Economic Research: Use Cases and Implications for Economists,” Journal of Economic Literature 61(4) December 2024 Update 1–74
Workspace | Key Features |
---|---|
Claude Analysis Tool | • Fast exploratory data analysis • Interactive visualizations with real-time adjustments |
Google NotebookLM | • Document upload for research grounding • Citation and quote provision • “Deep dive conversation” podcast generation |
Microsoft Copilot | • Assistance in Word, Excel, etc. • Data analysis, formula construction |
Google Gemini for Workspace | • Integration with Google’s office suite, Assistance in Docs etc |
Cursor AI Code Editor | • AI-assisted coding • Code suggestions and queries, optimization, debugging • Real-time collaboration |
Source: Korinek “Generative AI for Economic Research: Use Cases and Implications for Economists,” Journal of Economic Literature 61(4) December 2024 Update 1–74
Image created in detailed photorealistic style by Ralph Losey with ChatGPT4 Visual Muse version
Stochastic = when prompted repeatedly, LLMs may give different answers
Parrot = LLMs can repeat information without understanding
Philosophy = to what extent do they understand the state of the world?
Some more ideas
Literature Review & Summarization
AI helps quickly find relevant papers, summarize key arguments, and extract citations, saving time in reviewing large bodies of work.
Data Analysis & Coding Assistance
AI supports coding in R, Python, and Stata, assisting with debugging, automating repetitive tasks, and suggesting statistical methods for empirical research.
Writing & Editing Support
AI aids in drafting, structuring, and refining academic writing, improving clarity, grammar, and coherence while maintaining academic integrity.
All the time, ChatGPT 4o (Canvas), Claude (Projects), ChatGPT o1 (rare), both paid tiers
Github Copilot in VSCode and Rstudio
This presentation is massively helped by AI
U.S. Copyright Office 2025 Jan report Copyright and Artificial Intelligence Part 2: Copyrightability: copyright protection is intended for human-created works.
Note
“Do not provide sufficient human control to make users of an AI system the authors of the output. Prompts essentially function as instructions that convey unprotectable ideas. While highly detailed prompts could contain the user’s desired expressive elements, at present they do not control how the AI system processes them in generating the output.”
Note
Artificial intelligence software, such as chatbots or other large language models, may not be listed as an author. If artificial intelligence software was used in the preparation of the manuscript, including drafting or editing text, this must be briefly described during the submission process, which will help us understand how authors are or are not using chatbots or other forms of artificial intelligence. Authors are solely accountable for, and must thoroughly fact-check, outputs created with the help of artificial intelligence software.
Two key points from Elsevier policy generative AI policies for journals
by ChatGPT o3
Feature | o3 (reasoning‑first) | GPT‑4o | GPT 4/4.5 |
---|---|---|---|
Design focus | Built‑in pre‑answer reasoning pass → fewer hallucinations | Multimodal, real‑time latency | General‑purpose LLM; turbo = faster/cheaper |
Multimodality | Text + tool calls | Text + images + audio (I/O) | Text (+ images via plugins) |
Browsing / tools | Auto‑search for up‑to‑date facts | Optional; slower when invoked | Optional |
Default style | Concise, source‑cited | Chatty, demo‑friendly | Flexible, slightly verbose |
Context window | 32k tokens | 128k tokens | 128k (turbo) |
Strengths | Step‑wise analysis, audit trail | multimodal interaction | Coding & broad knowledge |
Weak spots | No native media I/O | May trade depth for speed | Still hallucination‑prone |
Phase | What o3 auto‑offers | GPT‑4o /4.5 |
---|---|---|
Data ingest | Reads user‑uploaded CSV / PDF, chooses python or file_search without extra prompting |
Explicit instructions (“Please read the file…”) |
Exploration / cleaning | Runs private python for stats, shows clean tables via python_user_visible |
Manual code requests; reasoning sometimes mixed in output |
Up‑to‑date context | Self‑initiates web searches when facts may be stale; cites inline | User must ask to search; citation less disciplined |
Reproducibility | Separates analysis vs. user‑visible code ⇒ clear audit trail | Code & commentary often inter‑mingled |
Output control | Defaults to concise bullets & tables; respects Yap score | Tends toward chattier prose unless steered |
Risk mgmt. | Built‑in step‑wise reasoning reduces logic slips in pipelines | More vulnerable to silent errors without “think‑step” prompts |
For data‑analysis projects (eg assignments, initial research), o3 itself says it behaves like a careful analyst who
Leaving GPT‑4o for flashy multimodal demos and 4‑turbo for rapid code generation.
Source: McKinsey Digital
AI as input, supervision, debugging, responsibility.
Without core knowledge you can’t interact
Strong knowledge and experience helps debugging
Gabors Data Analysis with AI - 2025-04-21 v0.4.3