Data Analysis with AI: Concepts

Large Language Models: Key Concepts

Gábor Békés (CEU)

2025-02-18

Intro to the concept of LLMs

Why

  • Teaching Data Analysis courses + prepping for 2nd edition of Data Analysis textbook

  • AI is both amazing help and scary as #C!*

  • This is a class to

    • discuss and share ideas of use
    • gain experience and confidence
    • find useful use cases
    • learn bit more about LLMs and their impact
  • Share cool stuff

    • Like this presentation in ‘revealjs’

This class – approach

  • focus on data analysis: code, stats, reporting
  • self-help group to openly discuss experience and trauma
  • get you some experience with selected tasks
  • move from execution as key skill to design and debugging
  • get you a class you can put into your CV

This class – topics

  • Week 1: Effectively summarize academic content
  • Week 2: Document data – World Values Survey (VWS)
  • Week 3: Exploratory data analysis and report creation - VWS
  • Week 4: Data manipulation, wrangling – Football games and teams
  • Week 5: Text analysis and information extraction – Interviews
  • Week 6: Combing data from text and tabular data, text analysis with APIs (TBD) – Interviews

Evolution of Language Models

Key Breakthrough: “Attention is All You Need” (2017)

  • Introduced the Transformer architecture
  • Eliminated need for sequential processing
  • Enabled massive parallelization
  • Foundation for all modern LLMs

What are Large Language Models?

  • Statistical models predicting next tokens
  • Transform text into mathematical space
  • Scale (training data) matters enormously
  • Pattern recognition at massive scale

LLMs as Prediction Machines

  • Economic Framework: Similar to forecasting models
    • Input → Black Box → Predicted Output
  • Key Difference: Works with unstructured text data
  • Training Process: Supervised learning at scale
  • Training Material: “Everything” (all internet + many books)

Understanding Tokens

Context Window & Memory

Note

Larger context = Better understanding but higher computational cost

Token window

  • 1 token = 4 characters, 4 tokens= 3 words (In English)

  • ChatGPT 2022 window of 4,000 tokens.

    • Over limit: hallucinate, off-topic.
  • ChatGPT 2025 – 128,000 tokens = 250p book

Hallucination: Prediction Errors

Type I Error (False Positive)

  • Generating incorrect but plausible information
  • Example: Creating non-existent research citations

Type II Error (False Negative)

  • Failing to generate correct information
  • Example: Missing key facts in training data

Economic Impact of errors

  • Cost of verification (humans, AI), risk assesment

Reinforcement Learning in LLMs

Key Components

  • RLHF: Reinforcement Learning from Human Feedback
    • Models learn from human preferences
    • Helps align outputs with human values
  • Constitutional AI
    • Models trained to follow specific rules
    • Reduces harmful or biased outputs
  • Direct Preference Optimization
    • Model trained to prefer responses that humans rank higher

Impact on LLMs

  • Better Alignment
    • More helpful responses
    • Reduced harmful content
    • Better instruction following
  • Improved Quality
    • More consistent outputs
    • Better reasoning
    • Clearer explanations

RL Improvements in Claude Development

Key RL Techniques Used

  • Constitutional AI
    • Core part of Claude’s development
    • Helps ensure helpful, safe responses
    • Improves reliability of outputs
  • Direct Preference Optimization (DPO)
    • More efficient than traditional RLHF
    • Reduces training complexity
    • Better alignment with preferences

Observable Improvements

  • Response Quality
    • More nuanced understanding
    • Better reasoning capabilities
    • More consistent outputs
  • Task Performance
    • Improved coding abilities
    • Better at complex analysis
    • More reliable fact adherence

Working with LLMs

Cyborgs vs Centaurs

The Centaur and Cyborg Approaches based on Co-Intelligence: Living and Working with AI By Ethan Mollick

Co-Intelligence

The Jagged Frontier of LLM Capabilities

  • lot of tasks may be considered to be done by LLM
  • Uncertainty re how well LLM will do them – “Jagged Frontier”
  • Some unexpectedly easy, others surprisingly hard
  • Testing the frontier for data analysis – this class

Image created Claude.ai

The Centaur Approach

  • Clear division between human and LLM tasks
  • Strategic task allocation based on strengths
  • Human maintains control and oversight
  • LLM used as a specialized tool
  • Quality through specialization
  • Better for high-stakes decisions

Image created in detailed photorealistic style by Ralph Losey with ChatGPT4 Visual Muse version

The Cyborg Approach

  • Deep integration between human and LLM
  • Continuous interaction and feedback
  • Iterative refinement of outputs
  • Learning from each interaction
  • Faster iteration cycles
  • More creative solutions emerge

Image created in detailed photorealistic style by Ralph Losey with ChatGPT4 Visual Muse version

Analysis Approaches: Centaur vs Cyborg

Centaur 🧑‍💻

Planning 👤 Design research plan
🤖 Suggest variables

Data Prep 👤 Define cleaning rules
🤖 Execute cleaning code 👤 Validate cleaning

Analysis 👤 Choose methods
🤖 Implement code
👤 Validate results

Reporting 👤 Outline findings
🤖 Draft sections
👤 Finalize

Cyborg 🦾

Planning 👤🤖 Interactive brainstorming
👤🤖 Collaborative refinement

Data Prep 👤🤖 Iterative cleaning
👤🤖 Real-time modification
👤🤖 Joint discovery

Analysis 👤🤖 Exploratory conversation
👤🤖 Dynamic adjustment
👤🤖 Continuous validation

Reporting 👤🤖 Co-writing process
👤🤖 Real-time feedback
👤🤖 Iterative improvement

Practical Guidelines

  1. Start with clear task boundaries (Centaur)
  2. Gradually increase integration (Cyborg)
  3. Many workflows combine both approaches
  4. Higher stakes = more control
  5. Always validate critical outputs
  6. Build experience in prompt engineering 📍 this class

Practical Guidelines (2025-02)

  1. Current LLMs good but not perfect
  2. Hard to fully outsource
  3. Cyborg is the default mode

Future of Data Analysis Workflows

What we see

  • Major gains in coding
  • Some gains elsewhere
  • Enhanced productivity (25-40% shown in studies)
  • Focus on human judgment and expertise

What we don’t see

  • Which tasks exactly
  • What new iteration LLM will improve

Key concepts for Using LLMs

LLM in work

  • Prompt as small task

    • New mindset: having an assistant: design, ask, check
    • 📍 This class Data Analysis related tasks
  • Built into coding

    • Github copilot in VSCode, RStudio, Jupyter Notebook
  • Specialized tools (ChatGPT Canvas, Claude Projects)

  • Anthropic “prompt generator” to optimize the prompts that via Anthropic Console Dashboard (click “Generate a Prompt”).

  • Agents

Prompt(ing): 2023–2025

  • In 2023-24, great deal of belief in prompt engineering as skill
  • In 2025 there are still useful concepts and ideas 📍 Week 2
  • But not many tricks.
  • Highly relevant response = provide any important details or context.

Interactive Workspaces for LLM Collaboration (1/2)

Major AI Platforms

Workspace Key Features
Anthropic Claude Artifacts • Dedicated output window
• Supports text, code, flowcharts, SVG graphics, websites, dashboards
• Real-time refinement and modification
• Sharing and remixing capabilities
ChatGPT Canvas • Separate collaboration window
• Text editing and coding capabilities
• Options for edits, length adjustment, reading level changes
• Code review and porting features
OpenAI Advanced Data Analysis • Data upload and analysis
• Visualization capabilities
• Python code execution in back end
• Error correction and refinement

Interactive Workspaces for LLM Collaboration (2/2)

Specialized Tools

Workspace Key Features
Claude Analysis Tool • Fast exploratory data analysis
• Interactive visualizations with real-time adjustments
Google NotebookLM • Document upload for research grounding
• Citation and quote provision
• “Deep dive conversation” podcast generation
Microsoft Copilot • Assistance in Word, Excel, PowerPoint, etc.
• Data analysis, formula construction
Google Gemini for Workspace • Integration with Google’s office suite, Assistance in Docs etc
Cursor AI Code Editor • AI-assisted coding
• Code suggestions and queries, optimization, debugging
• Real-time collaboration

Stochastic Parrot

Image created in detailed photorealistic style by Ralph Losey with ChatGPT4 Visual Muse version

Stochastic Parrots

  • Stochastic = when prompted repeatedly, LLMs may give different answers

  • Parrot = LLMs can repeat information without understanding

  • Philosophy = to what extent do they understand the state of the world?

  • List of words often used by LLMs

Data Analysis

  • To what extent running something yields same result? 📍 this class
  • How good are predictions? 📍 this class

LLM vs human 2024

  • LLM were trained on vast amounts of data, most of them low quality
  • Best = most likely
  • Human = trained on much less data, but higher quality
  • Best = learnt from experience, best people

Big debate on errors and hallucination

The issue is important in medicine

Medical research

Hallucination 2025

  • You can now ask for more thinking leading to less hallucination

ChatGPT convo on tokens

  • Use explicit push for calculations

ChatGPT convo on tokens 2

LLM vs human 2025

  • LLM also trained on scientific papers, books
  • New methods to improve accuracy
  • Solve scientific problems
  • Reasoning models, like OpenAI o3

Ethich and Law

Ethics

AI was created by using (stealing?) human knowledge

Is it Okay to use “Everything” as training material?

AI in research

Use of Artificial Intelligence in AER

Note

Artificial intelligence software, such as chatbots or other large language models, may not be listed as an author. If artificial intelligence software was used in the preparation of the manuscript, including drafting or editing text, this must be briefly described during the submission process, which will help us understand how authors are or are not using chatbots or other forms of artificial intelligence. Authors are solely accountable for, and must thoroughly fact-check, outputs created with the help of artificial intelligence software.

AI use cases

AI Use Cases: Student response

  1. Coding Assistance & Debugging
    • “It helps me with fixing errors in coding.”, “Find small errors I can’t on my own.”
    • “I used it to generate data in Excel to then work with it in Python and R.”
  2. Concept Clarification & Learning Support
    • ““To understand certain topics”, “for clarifications on macroeconomics and data analysis concepts.”
    • “For Micro and Macro courses, to understand graphs easily.”
  3. Writing & Proofreading
    • “I use AI for text and code touch-ups for smoother language.”
    • “While writing papers, I use ChatGPT as proofreader”, “improving the coherence.”
    • “I usually give an idea and AI makes it perfect.”

AI Use Cases: Predictions

  1. Literature Review & Summarization
    AI helps quickly find relevant papers, summarize key arguments, and extract citations, saving time in reviewing large bodies of work.

  2. Data Analysis & Coding Assistance
    AI supports coding in R, Python, and Stata, assisting with debugging, automating repetitive tasks, and suggesting statistical methods for empirical research.

  3. Writing & Editing Support
    AI aids in drafting, structuring, and refining academic writing, improving clarity, grammar, and coherence while maintaining academic integrity.

How I use it?

  • All the time, ChatGPT 4o (Canvas), Claude (Projects), ChatGPT o1 (rare), both paid tiers

  • Github Copilot in VSCode and Rstudio

  • This presentation is massively helped by AI

    • How can I make a presentation as html –> Quarto and revealjs
    • add boxes, etc, create yml, customs.css, add to my website
    • Content on RFHL, tokens
    • Summary slides on cyborg vs centaur

What were bad experience with AI?

Topics

  • Background work
  • Coding
  • Discussion of topics, results

To learn more

Readings

To learn more:

Understanding LLMs

Interviews, podcasts

Video Resources

Conclusions and discussion

Many LLMs, constant evolution

Gabor’s current take 1

AI is part of life

Evolution to continue

  • AI tools will evolve in precision but will make errors
  • AI tools will be better in finding bugs but make new ones
  • Consistency will remain a problem.
  • Compute will be cheaper, personal AI on phone (not cloud)

Gabor’s current take 2

Your place

  • Without core knowledge you can’t interact
  • Strong knowledge and experience helps debugging
  • Cheaper data analysis = more use cases

Date stamp

This version: 2025-02-18

bekesg@ceu.edu