Data and code

Publish the data and code or it didn’t happen *

We have created the textbook as a complete package of text, code and data. While the textbook is available for money, code and data are free!

You can start exploration of the ecosystem around Gabors Data Analysis:

Below, we describe how to set up your system, get access to data and code, and replicate graphs and tables in the book.

Basic setup

Let us start with a simple way to store data and code on your computer. To ensure smooth sailing, you will need to create to folders on your computer, anywhere you like.

For the code: da_case_studies

For the data: da_data_repo

This setup is assumed for all the codes in R, Python and Stata.

Getting code

All the code that reproduces all the tables and graphs in the textbook is available freely to use.

Organization:

  1. Each case study has a separate folder.
  2. Within case study folders, codes in different languages are simply stored together.
  3. Some intermediary files (csv, rds) may be saved there, too.
  4. Output folders are created when you run the code

All codes in R, Python, and Stata should work well. We regularly make edits and updates to improve.
Check the Github repo for latest release

Option A: Download in one [advised]

The whole codebase for the textbook may be simply downloaded, currently we have the pre-release version, codename: v0.8.1-pre-release "Sweet and Full of Grace".

Steps

  1. Download it in a zipped file
  2. Unzip and rename da_case_studies

Option B: Fork and clone from Github [advanced]

Visit the Github page github.com/gabors-data-analysis/da_case_studies where the live version of the code is available.

Steps

  1. Sign up to Github
  2. Visit the Github page github.com/gabors-data-analysis/da_case_studies
  3. Fork the da_case_studies repository
  4. Clone to a local drive, name it da_case_studies

Getting data

Data is shared via a OSF project repository.

Option A: download dataset folders [advised]

Steps

  1. Create a da_data_repo folder on your local computer.
  2. Visit the OSF project repository. You will see a list of datasets. You will need to download each dataset folder one by one.
  3. For each dataset, click on the OSF Storage(United States) or OSF Storage(Germany - Frankfurt) icon and download as zip.
  4. Extract from the zip, making sure that the folder name is exactly the same as in the OSF repository
  5. Repeat for all the datasets you need.
  6. Add the dataset folders to a da_data_repo folder to ensure all codes work smoothly.

Option B: Download the whole textbook material

You can download a single ZIP file that contains all datasets, with clean datasets only. This is for size considerations (the raw data are 20GB or so).

To get it just visit the da_data_repo site of our OSF repo, download, unzip and enjoy.

Option C: Directly open from script

At the same time, each dataset is a component and files may be directly opened from code. For example, with the hotel-europe dataset:

R: data1<-read.csv(url("https://osf.io/p6tyr/download"))

Python: pd.read_csv("https://osf.io/p6tyr/download")

Stata: import delimited "https://osf.io/p6tyr/download"

Really, really simple.

Setting up to run code

You will need install libraries and make some minor edits in some code bits. Tasks vary depending on the coding language. This textbook is coding language neutral. Our code is written in all three most widely used tools for data analysis. See our brief summary, so pick one and follow instructions!

How to set up for Stata?
How to set up for R?
How to set up for Python?

Some advice on learning to code

Differences in output

The graphs and results in the textbook come from R. However, most results and graphs should be the same when running from Stata or Python.

There could some differences across output from different languages.

  1. Graphs may vary as some settings vary. We made a great effort to reduce this as much as possible - sometimes adding more paramateres to graph making bits than we would normally do.
  2. Whenever there is any randomization in the background, results will indeed differ (example is cross-validation)
  3. Some minor differences are caused by variation in some defaults in some formula, such as degree of freedom (example is BIC)