Week02: Discovery and documentation
Data discovery and data and code documentation with AI
Week02: Discovery and documentation
Data discovery and data and code documentation with AI
Objectives
Summary
Sometimes data is large and discovery is hard. Sometimes you need to write data documentation. LLMs can help. You will learn how to write a clear and professional README. We use a cleaned subset of the 7th Wave of the World Values Survey (WVS). We’ll also talk some tech on documentation.
Learning Objectives:
- Understand how to document a new dataset using as an example th WVS 7th wave data.
- Create a README that describes data.
- Learn to refine documentation by incorporating iterative feedback from peers and AI tools.
Preparation BEFORE class
Reading and review
- Background reading: Békés-Kézdi (2021) Chapters 1-3, in particular core background info
- Some discussion of data types Data Management in Large-Scale Education Research by Crystal Lewis
Get data and info:
Access the VWS dataset 1. Data: WVS_random_subset.csv - random subset (N=2000) - covering all countries 2. Download its official codebook documentation
If you prefer datasets are also at OSF, Gabors Data Analysis / World Values Survey
Class plan
Review Assignment 01
- Follow instructions.
- How to get close to original, different ways
- Why do an app? What to expect from an app
- streamlit
- shinyapps
I. Background
About Markdown
- Editor in R, Python Quarto
- Online Markdown editor
- Also: Pandoc
What is a good readme?
Some examples for reproduction package
Békés-Kézdi (2021) Hotels dataset – show basics
Koren-Pető (2021) Business disruptions from social distancing as PDF
Some ideas on readme: Makereadme, Social Science Editors
Key ingredients
- Overview of project
- license
- All datasets (data tables) separately discussed
- All key variables described (name, content, type, coverage (% share missing)
- maybe also: source, extension (csv / xlsx/ parquet)
What is a variable dictionary (also called codebook)
- more details of a dataset, often as xlsx
- metric (euro, %), meaning of values if categorical
- maybe even mean, min, max
Examples
- Békés-Kézdi (2021) Bisnode dataset variables
- Reif (2022) illinois-wellness-data
II. Work on data
No AI
- Download and look at the Random Subset data
- Start collecting some info on the data without AI
- Start thinking about an interesting research question (find \(y\) and \(x\))
AI: let AI teach you also about
- Start asking for skeleton readme, ask about advice
- Discussion
AI: Learning and idea generation
- Tell AI about your plan and need for a readme
- experiment with one-shot vs interaction
- Discussion
Cyborg mode: create a readme with AI
- Upload the codebook + random subset data
- Get AI to design a README TEMPLATE for this task.
- Get a draft
- Understand and edit draft
III additional idea
- Sometimes, complicated projects have extensive folder structure. Use A to design a folder structure
End of Week Discussion points
- What was the biggest contribution of AI?
- First result vs after iterations – what did improve?
- How do you feel about learning from AI vs human instructor? Pros and cons?
Assignment
See suggested assignment for week 02