Data Analysis with AI: Concepts

Large Language Models: Key Concepts

Gábor Békés (CEU)

2025-04-16

Intro to the concept of LLMs

Use of Artificial Intelligence

Why

  • Teaching Data Analysis courses + prepping for 2nd edition of Data Analysis textbook

  • AI is both amazing help and scary as #C!*

  • This is a class to

    • discuss and share ideas of use
    • gain experience and confidence
    • find useful use cases
    • learn bit more about LLMs and their impact
  • Try out different ways to approach a problem

    • One prompt vs interaction
    • Compare human vs machine understanding of a text

This class – approach

  • focus on data analysis steps: research question, code, statistics, reporting
  • self-help group to openly discuss experience and trauma
  • get you some experience with selected tasks
  • move from execution as key skill to design and debugging
  • get you a class you can put into your CV
  • (extra) talk about topics I care about in data analysis

This class – topics and case studies

  • Week 1: Review LLMs – An FT graph
  • Week 2: EDA and data documentation – World Values Survey (VWS)
  • Week 3: Analysis and report creation – World Values Survey (VWS)
  • Week 4: Data manipulation, wrangling – Synthetic Hotels
  • Week 5: Text analysis and information extraction – Post match interviews (VWS)
  • Week 6: Different ways of sentiment analysis – Post match interviews (VWS)

LLM Development Timeline

Key Milestones in LLM Development I

  • Neural Language Models (2003): First successful application of neural networks to language modeling, establishing the statistical foundations for predicting word sequences based on context.

  • Word Embeddings (2013): Development of Word2Vec and distributed representations, enabling words to be mapped into vector spaces where semantic relationships are preserved mathematically.

  • Transformer Architecture (2017): Introduction of the Transformer model with self-attention mechanisms, eliminating sequential computation constraints and enabling efficient parallel processing.

Key Milestones in LLM Development II

  • Pretraining + Fine-tuning (2018): BERT - Emergence of the two-stage paradigm where models are first pretrained on vast unlabeled text, then fine-tuned for specific downstream tasks.

  • ChatGPT (2022): Release of a conversational AI interface that demonstrated unprecedented natural language capabilities to the general public, driving mainstream adoption.

  • Reinforcement Learning from Human Feedback (2023): Refinement of models through human preferences, aligning AI outputs with human values and reducing harmful responses.

References

  • [1]: Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). “A Neural Probabilistic Language Model.” Journal of Machine Learning Research.
  • [2]: Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). “Distributed Representations of Words and Phrases and their Compositionality.” Advances in Neural Information Processing Systems.
  • [3]: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). “Attention Is All You Need.” Advances in Neural Information Processing Systems.
  • [4]: Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv preprint.
  • [5]: OpenAI. (2022). “ChatGPT: Optimizing Language Models for Dialogue.” OpenAI Blog.
  • [6]: Anthropic. (2023). “Constitutional AI: Harmlessness from AI Feedback.” arXiv preprint.

Key Milestones in LLM Development III

  • New ideas on using synthetic data to train models
  • DeepSeek’s cheaper approach
  • Agentic AI

What are Large Language Models?

  • Statistical models predicting next tokens
  • Transform text into mathematical space
  • Scale (training data) matters enormously
  • Pattern recognition at massive scale

LLMs as Prediction Machines

  • Economic Framework: Similar to forecasting models
    • Input → Black Box → Predicted Output
  • Key Difference: Works with unstructured text data
  • Training Process: Supervised learning at scale
  • Training Material: “Everything” (all internet + many books)

Understanding Tokens

Context Window & Memory

Note

Larger context = Better understanding but higher computational cost

Token window

  • 1 token = 4 characters, 4 tokens= 3 words (In English)

  • varies by models

  • ChatGPT 2022 window of 4,000 tokens.

  • ChatGPT 2025 window of 128,000 tokens = 250p book

  • Tokens matter – more context, more relevant answers

    • Over limit: hallucinate, off-topic.

Hallucination: Prediction Errors

Type I Error (False Positive)

  • Generating incorrect but plausible information
  • Example: Creating non-existent research citations

Type II Error (False Negative)

  • Failing to generate correct information
  • Example: Missing key facts in training data

Economic Impact of errors

  • Cost of verification (humans, AI), risk assesment

Reinforcement Learning in LLMs

Key Components

  • RLHF: Reinforcement Learning from Human Feedback
    • Models learn from human preferences
    • Helps align outputs with human values
  • Constitutional AI
    • Models trained to follow specific rules
    • Reduces harmful or biased outputs
  • Direct Preference Optimization
    • Model trained to prefer responses that humans rank higher

Impact on LLMs

  • Better Alignment
    • More helpful responses
    • Reduced harmful content
    • Better instruction following
  • Improved Quality
    • More consistent outputs
    • Better reasoning
    • Clearer explanations

RL Improvements in Claude Development

Key RL Techniques Used

  • Constitutional AI
    • Core part of Claude’s development
    • Helps ensure helpful, safe responses
    • Improves reliability of outputs
  • Direct Preference Optimization (DPO)
    • More efficient than traditional RLHF
    • Reduces training complexity
    • Better alignment with preferences

Observable Improvements

  • Response Quality
    • More nuanced understanding
    • Better reasoning capabilities
    • More consistent outputs
  • Task Performance
    • Improved coding abilities
    • Better at complex analysis
    • More reliable fact adherence

What’s new (2025-04): AI as teammate

  • AI as teammate
  • Have a group of people work with AI as a teammate, include in discussion, etc
    • Started already with medical teams

Working with LLMs

Cyborgs vs Centaurs

The Centaur and Cyborg Approaches based on Co-Intelligence: Living and Working with AI By Ethan Mollick

Co-Intelligence

The Jagged Frontier of LLM Capabilities

  • lot of tasks may be considered to be done by LLM
  • Uncertainty re how well LLM will do them – “Jagged Frontier”
  • Some unexpectedly easy, others surprisingly hard
  • Testing the frontier for data analysis – this class

Image created Claude.ai

The Centaur Approach

  • Clear division between human and LLM tasks
  • Strategic task allocation based on strengths
  • Human maintains control and oversight
  • LLM used as a specialized tool
  • Quality through specialization
  • Better for high-stakes decisions

Image created in detailed photorealistic style by Ralph Losey with ChatGPT4 Visual Muse version

The Cyborg Approach

  • Deep integration between human and LLM
  • Continuous interaction and feedback
  • Iterative refinement of outputs
  • Learning from each interaction
  • Faster iteration cycles
  • More creative solutions emerge

Image created in detailed photorealistic style by Ralph Losey with ChatGPT4 Visual Muse version

Analysis Approaches: Centaur vs Cyborg

Stage Centaur 🧑‍💻 Cyborg 🦾
Plan 👤 Design research plan
🤖 Suggest variables
👤🤖 Interactive brainstorming
👤🤖 Collaborative refinement
Data Prep 👤 Define cleaning rules
🤖 Execute cleaning code
👤 Validate
👤🤖 Iterative cleaning
👤🤖 Joint discovery and modification
Analysis 👤 Choose methods
🤖 Implement code
👤 Validate results
👤🤖 Exploratory conversation
👤🤖 Dynamic adjustment
👤🤖 Continuous validation
Reporting 👤 Outline findings
🤖 Draft sections
👤 Finalize
👤🤖 Co-writing process
👤🤖 Real-time feedback
👤🤖 Iterative improvement

Practical Guidelines

  1. Start with clear task boundaries (Centaur)
  2. Gradually increase integration (Cyborg)
  3. Many workflows combine both approaches
  4. Higher stakes = more control
  5. Always validate critical outputs
  6. Build experience in prompt engineering 📍 this class

Practical Guidelines (2025-04)

  1. Current LLMs good but not perfect
  2. Hard to fully outsource
  3. Cyborg is the default mode
  4. AI as team mate is emerging

Future of Data Analysis Workflows

What we see

  • Major gains in coding
  • Some gains elsewhere
  • Enhanced productivity (25-40% shown in studies)
  • Focus on human judgment and expertise

What we don’t see

  • Which tasks exactly
  • What new iteration LLM will improve

Key concepts for Using LLMs

LLM in work

  • Prompt as small task

    • New mindset: having an assistant: design, ask, check
    • 📍 This class Data Analysis related tasks
  • Built into coding

    • Github copilot in VSCode, RStudio, Jupyter Notebook
  • Specialized tools (ChatGPT Canvas, Claude Projects)

  • Anthropic “prompt generator” to optimize the prompts that via Anthropic Console Dashboard (click “Generate a Prompt”).

  • Agents

Prompt(ing): 2023–2025

  • In 2023-24, great deal of belief in prompt engineering as skill
  • In 2025 there are still useful concepts and ideas 📍 Week 2
  • But not many tricks.
  • Highly relevant response = provide any important details or context.

Interactive Workspaces for LLM Collaboration (1/2)

Major AI Platforms

Workspace Key Features
Anthropic Claude Artifacts • Dedicated output window
• Supports text, code, flowchart, SVG, website
• Real-time refinement and modification
• Sharing and remixing capabilities
ChatGPT Canvas • Separate collaboration window
• Text editing and coding capabilities
• Options for edits, length adjustment
• Code review and porting features
OpenAI Advanced Data Analysis • Data upload and analysis
• Visualization capabilities
• Python code execution in back end
• Error correction and refinement

Interactive Workspaces for LLM Collaboration (2/2)

Specialized Tools

Workspace Key Features
Claude Analysis Tool • Fast exploratory data analysis
• Interactive visualizations with real-time adjustments
Google NotebookLM • Document upload for research grounding
• Citation and quote provision
• “Deep dive conversation” podcast generation
Microsoft Copilot • Assistance in Word, Excel, etc.
• Data analysis, formula construction
Google Gemini for Workspace • Integration with Google’s office suite, Assistance in Docs etc
Cursor AI Code Editor • AI-assisted coding
• Code suggestions and queries, optimization, debugging
• Real-time collaboration

Stochastic Parrot

Image created in detailed photorealistic style by Ralph Losey with ChatGPT4 Visual Muse version

Stochastic Parrots

  • Stochastic = when prompted repeatedly, LLMs may give different answers

  • Parrot = LLMs can repeat information without understanding

  • Philosophy = to what extent do they understand the state of the world?

  • List of words often used by LLMs

Data Analysis

  • To what extent running something yields same result? 📍 this class
  • How good are predictions? 📍 this class

Big debate on errors and hallucination

The issue is important in medicine

Medical research

Hallucination 2025

  • You can now ask for more thinking leading to less hallucination

ChatGPT convo on tokens

  • Use explicit push for calculations

ChatGPT convo on tokens 2

LLM vs human 2025

  • LLM also trained on scientific papers, books
  • New methods to improve accuracy
  • Solve scientific problems
  • Reasoning models, like OpenAI o1 (o3 to come)

AI use cases

AI Use Cases: Student response

  1. Coding Assistance & Debugging
    • “It helps me with fixing errors in coding.”, “Find small errors I can’t on my own.”
    • “I used it to generate data in Excel to then work with it in Python and R.”
    • “I used it for Python projects .. instead of Google to get answers fast.”
  2. Concept Clarification & Learning Support
    • ““To understand certain topics”, “for clarifications on macroeconomics and data analysis concepts.”
    • “For Micro and Macro courses, to understand graphs easily.”
    • “I uploaded the material and explained what I wanted to do in detail and asked it to create me a study guide”
  3. Writing & Proofreading
    • “I use AI for text and code touch-ups for smoother language.”
    • “While writing papers, I use ChatGPT as proofreader”, “improving the coherence.”
    • “I usually give an idea and AI makes it perfect.”

AI Use Cases: Predictions

  1. Literature Review & Summarization
    AI helps quickly find relevant papers, summarize key arguments, and extract citations, saving time in reviewing large bodies of work.

  2. Data Analysis & Coding Assistance
    AI supports coding in R, Python, and Stata, assisting with debugging, automating repetitive tasks, and suggesting statistical methods for empirical research.

  3. Writing & Editing Support
    AI aids in drafting, structuring, and refining academic writing, improving clarity, grammar, and coherence while maintaining academic integrity.

How I use it?

  • All the time, ChatGPT 4o (Canvas), Claude (Projects), ChatGPT o1 (rare), both paid tiers

  • Github Copilot in VSCode and Rstudio

  • This presentation is massively helped by AI

    • How can I make a presentation as html –> Quarto and revealjs
    • add boxes, etc, create yml, customs.css, add to my website
    • Content on RFHL, tokens
    • Summary slides on cyborg vs centaur

How I use it?

  • Ask AI for ideas
    • Write first draft fully alone – it helps me think through
    • use ai improve sentence by sentence (to avoid blank BS)
    • use ai to shorten
  • Do white board / notebook thinking
    • Get AI to OCR and make to text
  • Write code
    • from scratch
    • code review
    • debug

What were bad experience with AI?

Topics

  • Background work
  • Coding
  • Discussion of topics, results

My bad experience

  • AI written text is typically
    • Good grammar
    • Convincing structure
    • Bland and unoriginal
  • One paragraph or one page is hard tell apart from a human
  • 10 pages, 10 papers – easy to see

Ethich and Law

Ethics

AI was created by using (stealing?) human knowledge

Is it Okay to use “Everything” as training material?

AI in research

Use of Artificial Intelligence in AER

Note

Artificial intelligence software, such as chatbots or other large language models, may not be listed as an author. If artificial intelligence software was used in the preparation of the manuscript, including drafting or editing text, this must be briefly described during the submission process, which will help us understand how authors are or are not using chatbots or other forms of artificial intelligence. Authors are solely accountable for, and must thoroughly fact-check, outputs created with the help of artificial intelligence software.

AI in research: Elsevier

Two key points from Elsevier policy generative AI policies for journals

  • report for transparency
  • supervise, take responsibility

Use of Artificial Intelligence in classes

You gotta stay a learning human

Conclusions and discussion

Many LLMs, constant evolution

To learn more

  • Look at beyond where I collect blog posts, videos, books, papers.

Gabor’s current take I

Should study

  • You have to learn stuff even if AI can also do it.
    • Good writing
    • core coding
  • Be a well rounded educated human
  • Because to supervise AI you need to know what to look for

Use of AI – need to report?

  • My view in 2024. Report what you have done
  • My view in 2025. No need to report, AI is now like internet search (or electricity)

Gabor’s current take II

Your place with AI

  • AI as input, supervision, debugging, responsibility.

  • Without core knowledge you can’t interact

  • Strong knowledge and experience helps debugging

Future: more opportunities

  • Cheaper data analysis = more use cases

Status

  • This is version 0.4.0
  • created after teaching the class
  • redesigned for 6*100-120 minutes – you can adapt class work and assignments to fit time
  • assumes audience familiar with basics of coding and data analysis

bekesg@ceu.edu