Large Language Models: Key Concepts
2025-04-16
Teaching Data Analysis courses + prepping for 2nd edition of Data Analysis textbook
AI is both amazing help and scary as #C!*
This is a class to
Try out different ways to approach a problem
Neural Language Models (2003): First successful application of neural networks to language modeling, establishing the statistical foundations for predicting word sequences based on context.
Word Embeddings (2013): Development of Word2Vec and distributed representations, enabling words to be mapped into vector spaces where semantic relationships are preserved mathematically.
Transformer Architecture (2017): Introduction of the Transformer model with self-attention mechanisms, eliminating sequential computation constraints and enabling efficient parallel processing.
Pretraining + Fine-tuning (2018): BERT - Emergence of the two-stage paradigm where models are first pretrained on vast unlabeled text, then fine-tuned for specific downstream tasks.
ChatGPT (2022): Release of a conversational AI interface that demonstrated unprecedented natural language capabilities to the general public, driving mainstream adoption.
Reinforcement Learning from Human Feedback (2023): Refinement of models through human preferences, aligning AI outputs with human values and reducing harmful responses.
Tokenization example showing how text is processed
Note
Larger context = Better understanding but higher computational cost
1 token = 4 characters, 4 tokens= 3 words (In English)
varies by models
ChatGPT 2022 window of 4,000 tokens.
ChatGPT 2025 window of 128,000 tokens = 250p book
Tokens matter – more context, more relevant answers
The Centaur and Cyborg Approaches based on Co-Intelligence: Living and Working with AI By Ethan Mollick
Co-Intelligence
Image created Claude.ai
Image created in detailed photorealistic style by Ralph Losey with ChatGPT4 Visual Muse version
Image created in detailed photorealistic style by Ralph Losey with ChatGPT4 Visual Muse version
Stage | Centaur 🧑💻 | Cyborg 🦾 |
---|---|---|
Plan | 👤 Design research plan 🤖 Suggest variables |
👤🤖 Interactive brainstorming 👤🤖 Collaborative refinement |
Data Prep | 👤 Define cleaning rules 🤖 Execute cleaning code 👤 Validate |
👤🤖 Iterative cleaning 👤🤖 Joint discovery and modification |
Analysis | 👤 Choose methods 🤖 Implement code 👤 Validate results |
👤🤖 Exploratory conversation 👤🤖 Dynamic adjustment 👤🤖 Continuous validation |
Reporting | 👤 Outline findings 🤖 Draft sections 👤 Finalize |
👤🤖 Co-writing process 👤🤖 Real-time feedback 👤🤖 Iterative improvement |
Prompt as small task
Built into coding
Specialized tools (ChatGPT Canvas, Claude Projects)
Anthropic “prompt generator” to optimize the prompts that via Anthropic Console Dashboard (click “Generate a Prompt”).
Agents
Workspace | Key Features |
---|---|
Anthropic Claude Artifacts | • Dedicated output window • Supports text, code, flowchart, SVG, website • Real-time refinement and modification • Sharing and remixing capabilities |
ChatGPT Canvas | • Separate collaboration window • Text editing and coding capabilities • Options for edits, length adjustment • Code review and porting features |
OpenAI Advanced Data Analysis | • Data upload and analysis • Visualization capabilities • Python code execution in back end • Error correction and refinement |
Source: Korinek “Generative AI for Economic Research: Use Cases and Implications for Economists,” Journal of Economic Literature 61(4) December 2024 Update 1–74
Workspace | Key Features |
---|---|
Claude Analysis Tool | • Fast exploratory data analysis • Interactive visualizations with real-time adjustments |
Google NotebookLM | • Document upload for research grounding • Citation and quote provision • “Deep dive conversation” podcast generation |
Microsoft Copilot | • Assistance in Word, Excel, etc. • Data analysis, formula construction |
Google Gemini for Workspace | • Integration with Google’s office suite, Assistance in Docs etc |
Cursor AI Code Editor | • AI-assisted coding • Code suggestions and queries, optimization, debugging • Real-time collaboration |
Source: Korinek “Generative AI for Economic Research: Use Cases and Implications for Economists,” Journal of Economic Literature 61(4) December 2024 Update 1–74
Image created in detailed photorealistic style by Ralph Losey with ChatGPT4 Visual Muse version
Stochastic = when prompted repeatedly, LLMs may give different answers
Parrot = LLMs can repeat information without understanding
Philosophy = to what extent do they understand the state of the world?
Literature Review & Summarization
AI helps quickly find relevant papers, summarize key arguments, and extract citations, saving time in reviewing large bodies of work.
Data Analysis & Coding Assistance
AI supports coding in R, Python, and Stata, assisting with debugging, automating repetitive tasks, and suggesting statistical methods for empirical research.
Writing & Editing Support
AI aids in drafting, structuring, and refining academic writing, improving clarity, grammar, and coherence while maintaining academic integrity.
All the time, ChatGPT 4o (Canvas), Claude (Projects), ChatGPT o1 (rare), both paid tiers
Github Copilot in VSCode and Rstudio
This presentation is massively helped by AI
U.S. Copyright Office 2025 Jan report Copyright and Artificial Intelligence Part 2: Copyrightability: copyright protection is intended for human-created works.
Note
“Do not provide sufficient human control to make users of an AI system the authors of the output. Prompts essentially function as instructions that convey unprotectable ideas. While highly detailed prompts could contain the user’s desired expressive elements, at present they do not control how the AI system processes them in generating the output.”
Note
Artificial intelligence software, such as chatbots or other large language models, may not be listed as an author. If artificial intelligence software was used in the preparation of the manuscript, including drafting or editing text, this must be briefly described during the submission process, which will help us understand how authors are or are not using chatbots or other forms of artificial intelligence. Authors are solely accountable for, and must thoroughly fact-check, outputs created with the help of artificial intelligence software.
Two key points from Elsevier policy generative AI policies for journals
AI as input, supervision, debugging, responsibility.
Without core knowledge you can’t interact
Strong knowledge and experience helps debugging
Gabors Data Analysis with AI - 2025-04-14 v0.4.1