Week02: Code and data discovery and documentation with AI
# Objectives
## Summary
Sometimes data is large and discovery is hard. Sometimes you need to write data documentation. LLMs can help. You will learn how to write a clear and professional README. We use a cleaned subset of the 7th Wave of the World Values Survey (WVS). We'll also talk some tech on documentation.
## Learning Objectives:
* Understand how to document a new dataset using as an example th WVS 7th wave data.
* Create a README that describes data.
* Learn to refine documentation by incorporating iterative feedback from peers and AI tools.
## Preparation BEFORE class
### Reading and review
* Background reading: Békés-Kézdi (2021) Chapters 1-3, in particular [core background info](/week02/assets/da-background.md)
* Some discussion of data types [Data Management in Large-Scale Education Research](https://datamgmtinedresearch.com/structure) by Crystal Lewis
### Get data and info:
Access the [VWS dataset](/data/VWS)
1. Data: [WVS_random_subset.csv](/data/VWS/WVS_random_subset2000.csv) - random subset (N=2000) - covering all countries
2. Download its official [codebook documentation](/data/VWS/codebook.pdf)
If you prefer datasets are also at [OSF, Gabors Data Analysis / World Values Survey](https://osf.io/mfd6s/)
# Class plan
## Review Assignment 01
* Follow instructions.
* How to get close to original, different ways
* Why do an app? What to expect from an app
* streamlit
* shinyapps
## I. Background
### About Markdown
* Editor in R, Python [Quarto](https://quarto.org/)
* Online [Markdown editor](https://jbt.github.io/markdown-editor/)
* Also: [Pandoc](https://pandoc.org/)
### What is a good readme?
**Some examples for reproduction package**
* Békés-Kézdi (2021) [Hotels dataset](https://gabors-data-analysis.com/datasets/hotels-europe/) -- show basics
* Koren-Pető (2021) [Business disruptions from social distancing](https://zenodo.org/records/4016325/preview/README.md?include_deleted=0) as [PDF](https://zenodo.org/records/4016325/files/README.pdf?download=1)
* Some ideas on readme: [Makereadme](https://www.makeareadme.com/), [Social Science Editors](https://social-science-data-editors.github.io/template_README/)
**Key ingredients**
* Overview of project
* license
* All datasets (data tables) separately discussed
* All key variables described (name, content, type, coverage (% share missing)
* maybe also: source, extension (csv / xlsx/ parquet)
### What is a variable dictionary (also called codebook)
* more details of a dataset, often as xlsx
* metric (euro, %), meaning of values if categorical
* maybe even mean, min, max
**Examples**
* Békés-Kézdi (2021) [Bisnode dataset variables](https://osf.io/9a3t4)
* Reif (2022) [illinois-wellness-data](https://github.com/reifjulian/illinois-wellness-data/blob/master/data/codebooks/firm_admin.codebook.txt)
## II. Work on data
### No AI
* Download and look at the Random Subset data
* Start collecting some info on the data without AI
* Start thinking about an interesting research question (find $y$ and $x$)
### AI: let AI teach you also about
* Start asking for skeleton readme, ask about advice
* Discussion
### AI: Learning and idea generation
* Tell AI about your plan and need for a readme
* experiment with one-shot vs interaction
* Discussion
### Cyborg mode: create a readme with AI
* Upload the codebook + random subset data
* Get AI to design a README TEMPLATE for this task.
* Get a draft
* Understand and edit draft
### III additional idea
* Sometimes, complicated projects have extensive folder structure. Use A to design a folder structure
## End of Week Discussion points
* What was the biggest contribution of AI?
* First result vs after iterations -- what did improve?
* How do you feel about learning from AI vs human instructor? Pros and cons?
# Assignment
See suggested [assignment for week 02](assignment/assignment_02.md)