Week02: Code and data discovery and documentation with AI

# Objectives ## Summary Sometimes data is large and discovery is hard. Sometimes you need to write data documentation. LLMs can help. You will learn how to write a clear and professional README. We use a cleaned subset of the 7th Wave of the World Values Survey (WVS). We'll also talk some tech on documentation. ## Learning Objectives: * Understand how to document a new dataset using as an example th WVS 7th wave data. * Create a README that describes data. * Learn to refine documentation by incorporating iterative feedback from peers and AI tools. ## Preparation BEFORE class ### Reading and review * Background reading: Békés-Kézdi (2021) Chapters 1-3, in particular [core background info](/week02/assets/da-background.md) * Some discussion of data types [Data Management in Large-Scale Education Research](https://datamgmtinedresearch.com/structure) by Crystal Lewis ### Get data and info: Access the [VWS dataset](/data/VWS) 1. Data: [WVS_random_subset.csv](/data/VWS/WVS_random_subset2000.csv) - random subset (N=2000) - covering all countries 2. Download its official [codebook documentation](/data/VWS/codebook.pdf) If you prefer datasets are also at [OSF, Gabors Data Analysis / World Values Survey](https://osf.io/mfd6s/) # Class plan ## Review Assignment 01 * Follow instructions. * How to get close to original, different ways * Why do an app? What to expect from an app * streamlit * shinyapps ## I. Background ### About Markdown * Editor in R, Python [Quarto](https://quarto.org/) * Online [Markdown editor](https://jbt.github.io/markdown-editor/) * Also: [Pandoc](https://pandoc.org/) ### What is a good readme? **Some examples for reproduction package** * Békés-Kézdi (2021) [Hotels dataset](https://gabors-data-analysis.com/datasets/hotels-europe/) -- show basics * Koren-Pető (2021) [Business disruptions from social distancing](https://zenodo.org/records/4016325/preview/README.md?include_deleted=0) as [PDF](https://zenodo.org/records/4016325/files/README.pdf?download=1) * Some ideas on readme: [Makereadme](https://www.makeareadme.com/), [Social Science Editors](https://social-science-data-editors.github.io/template_README/) **Key ingredients** * Overview of project * license * All datasets (data tables) separately discussed * All key variables described (name, content, type, coverage (% share missing) * maybe also: source, extension (csv / xlsx/ parquet) ### What is a variable dictionary (also called codebook) * more details of a dataset, often as xlsx * metric (euro, %), meaning of values if categorical * maybe even mean, min, max **Examples** * Békés-Kézdi (2021) [Bisnode dataset variables](https://osf.io/9a3t4) * Reif (2022) [illinois-wellness-data](https://github.com/reifjulian/illinois-wellness-data/blob/master/data/codebooks/firm_admin.codebook.txt) ## II. Work on data ### No AI * Download and look at the Random Subset data * Start collecting some info on the data without AI * Start thinking about an interesting research question (find $y$ and $x$) ### AI: let AI teach you also about * Start asking for skeleton readme, ask about advice * Discussion ### AI: Learning and idea generation * Tell AI about your plan and need for a readme * experiment with one-shot vs interaction * Discussion ### Cyborg mode: create a readme with AI * Upload the codebook + random subset data * Get AI to design a README TEMPLATE for this task. * Get a draft * Understand and edit draft ### III additional idea * Sometimes, complicated projects have extensive folder structure. Use A to design a folder structure ## End of Week Discussion points * What was the biggest contribution of AI? * First result vs after iterations -- what did improve? * How do you feel about learning from AI vs human instructor? Pros and cons? # Assignment See suggested [assignment for week 02](assignment/assignment_02.md)