Data analysis as x,y,f,i,n,z
2025-05-01
Linkedin: linkedin.com/in/bekesgabor/ BlueSky: gaborbekes.bsky.social
Linkedin: linkedin.com/in/bekesgabor/ BlueSky: gaborbekes.bsky.social
Data Analysis Textbook: gabors-data-analysis.com, repo: github.com/gabors-data-analysis/
Understanding Relationships
How much more money do engineers with more experience make?
x = work hours
y = monthly salary
Do developers using AI have higher productivity?
x = use AI or not
y = PRs merged per week
Is blood pressure affected by diet?
x = fruits+veggies consumed
y = blood pressure
How does mileage affect car value?
x = odometer reading
y = price
Patterns: x ↔︎ y
Uncover associations between x and y
Prediction: x → y
Use x to predict y
Causality:x ⇒ y
Test if x causes y
Y = outcome, target (dependent variable)
X = predictor, causal variable (independent variable)
Key Points
x → y
Association Analysis
Understanding patterns, comparing conditional means
Compare average y for different values of x
First step, works for any dataset
Reveals relationships without making strong claims
Helps identify potential variables of interest
Generates hypotheses for further investigation
Prediction = Modelling for the future
Take the data we have to build model(s) = training data
Train data: x and y known
Use the model in a data in the future = live data
Live data: x known, y predicted
Does having more commits per month lead to getting a higher salary?
Does allowing workplace access to AI, has an effect on number of task professionals finish?
Does changing diet to more fruits and veggies cause a bp decline?
Lead. Make. Effect. Affect. Cause. == causal words – use with care.
Causal inference
Causal inference * understanding the what if question * Causality = Is room for an intervention (action) by an agent
Actions - Commit more frequently / Require a minimum amount of change per commit - Allow/Prohibit access to ChatGPT/GH copilot - Change diet
Approach vary…
… but intertwined
There could be many X variables
Prediction
Causality
Thinking about the nature of relationship
i= an observation
Data often generated as transaction, interaction, response
Task
How many observations we have for the analysis.
Size determines what we can expect from analysis
Dataset size matters greatly for prediction
Dataset size also helpful – gives us more certainty
Dataset size given
Dataset size decided
In between: sample design
But:
Confounder (z) are variables to prevent such claim
Framework Components Interact
x ↔︎ y ↔︎ f ↔︎ i ↔︎ N ↔︎ z
The type of observations we have determine what analysis we can do
Commit level: developer background and behavior
Commits in city: georgraphy of OSS
Commits per day: holidays and OSS
More detailed observations often allow for better identification of causal effects
I’m around:
Linkedin: linkedin.com/in/bekesgabor/ BlueSky: gaborbekes.bsky.social
Check out
Data Analysis Textbook: gabors-data-analysis.com, repo: github.com/gabors-data-analysis/
(with Miklós Koren, Julian Hinz, Aaron, Lohmann)
Setup
Cool original data, where we can observe
Thanks to to Kevin Xu at Github, Inc.
I’m around:
Linkedin: linkedin.com/in/bekesgabor/ BlueSky: gaborbekes.bsky.social
Check out
Data Analysis Textbook: gabors-data-analysis.com, repo: github.com/gabors-data-analysis/
‘xyfinz’ – Data Analysis for All with Gábor - GitHub Edition - 2025-05-01