xyfinz

Data analysis as x,y,f,i,n,z

Gábor Békés (CEU)

2025-05-01

‘xyfinz’: A framework for data analysis projects

about me

Economics prof by day, data analyst by (other) day
Central European University (Vienna, AT) + Research job in Budapest
Research on organizations w data from football, OSS, history
Occasional consulting: “adult supervision”

Linkedin: linkedin.com/in/bekesgabor/ BlueSky: gaborbekes.bsky.social

about me

Economics prof by day, data analyst by (other) day
Central European University (Vienna, AT) + Research job in Budapest
Research on organizations w Github data
How teams formed, and their success
Modelling OSS development process
Data Analysis Textbook

Linkedin: linkedin.com/in/bekesgabor/ BlueSky: gaborbekes.bsky.social
Data Analysis Textbook: gabors-data-analysis.com, repo: github.com/gabors-data-analysis/

(1.-2.) The x and y

The x and y

Understanding Relationships

Data analysis is mostly about understanding a relationship between two variables
We call them x and y

The x and y - Real World Examples (1)

Example 1

How much more money do engineers with more experience make?

x = work hours
y = monthly salary

Example 2

Do developers using AI have higher productivity?

x = use AI or not
y = PRs merged per week

The x and y - Real World Examples (2)

Example 3

Is blood pressure affected by diet?

x = fruits+veggies consumed
y = blood pressure

Example 4

How does mileage affect car value?

x = odometer reading
y = price

Patterns, prediction and causality

Data does not tell a story
You do

Data Analysis must serve a purpose of understanding
Data doesn’t tell stories on its own; you create the narrative through analysis.

Three Ways to Use x and y

Patterns: x ↔︎ y

Uncover associations between x and y

Prediction: x → y

Use x to predict y

Causality:x ⇒ y

Test if x causes y

The difference between x and y

Y = outcome, target (dependent variable)

X = predictor, causal variable (independent variable)

Key Points

Data itself is neutral
You decide what is x and y
Driven by what you want to understand

x → y

The x and y: association

Are people with higher N of commits per month, make more money?
Are the consultancy professionals using AI have higher productivity?
Is the blood pressure of people eating more fruits and veggies lower?

The x and y: association – setup

Association Analysis

Understanding patterns, comparing conditional means
Compare average y for different values of x
First step, works for any dataset
Reveals relationships without making strong claims
Helps identify potential variables of interest
Generates hypotheses for further investigation

The x and y: prediction

Could we use information on commit behavior to forecast salaries?
Could we predict number of tasks finished knowing if AI was used?
Does more fruits and veggies diet a good predictor of blood pressure?

The x and y: prediction – setup

Prediction = Modelling for the future

Take the data we have to build model(s) = training data
Train data: x and y known
Use the model in a data in the future = live data
Live data: x known, y predicted

The x and y: causality

Does having more commits per month lead to getting a higher salary?
Does allowing workplace access to AI, has an effect on number of task professionals finish?
Does changing diet to more fruits and veggies cause a bp decline?
Lead. Make. Effect. Affect. Cause. == causal words – use with care.

The x and y: causality – setup

Causal inference

Causal inference * understanding the what if question * Causality = Is room for an intervention (action) by an agent

Actions - Commit more frequently / Require a minimum amount of change per commit - Allow/Prohibit access to ChatGPT/GH copilot - Change diet

Prediction vs causality

Approach vary…

Prediction – what to expect given observed values…
Causality –what to expect, if we intervened

… but intertwined

For causality you still want to predict | on your action
For prediction, understanding underlying causal patterns → stable models

Many x

There could be many X variables

Prediction

More predictors → better model

Causality

Account for variation across groups

(3.) f()

y~f(x) is the relationship between y and x

Thinking about the nature of relationship

Data analysis compares mean y conditional on observable x:
- Average salary by years of experience:
- Tasks completed by AI usage:
f(x) defines how x relates to y - we can model this relationship in different ways

The relationship: y~f(x): Examples

f(x): Binary Relationship

f(x): Categorical Relationship

f(x): Linear Relationship

f(x): Quadratic Relationship

When f(x) matters?

Prediction: high – precise f(x) helps make better predictions (mo’ money)
Patterns of association: medium – yes but focus on core relationship
Causality: low – simplicity valued

(4.) i

What is an observation i

i= an observation

Data often generated as transaction, interaction, response

A commit pushed to a repo , i= a commit
(by account, repo, timestamp)
Survey answers to a questionnaire, i= respondent

Task

Understand and decide what we can do

What is an observation i – analyst decision

Analyst might aggregate
Decision on how to aggregate is analyst’s decision (~ RQ, legal, technical)

(5.) N

N is the size of the dataset

How many observations we have for the analysis.
Size determines what we can expect from analysis

N is the size of the dataset

Dataset size matters greatly for prediction

Large data – easier building predictive models
We have more chance finding patterns

Dataset size also helpful – gives us more certainty

Causal questions
Experiments

N is the size of the dataset – who decides

Dataset size given

All repos on Github as this second
All custom declarations entering LA sea port in 2024

Dataset size decided

Scope of a developer survey, N of respondents

In between: sample design

How many years to include
Exclude repos<5 commits, repos, devs of large orgs

(6.) Z

Z is what confounds the causal analysis

People who eat more veggies (x) have lower blood pressure (y)
Causal claim: let us make people eat more veggies to get lower bp

But:

People decide on many things = self-selection into action
They may run. People who run also eat more veggies, and have lower bp
They may eat less fries. Fat is bad for arteries, higher bp
They may go to doctor that gives them pill to lower bp

Z is what confounds the causal analysis

People who eat more veggies (x) have lower blood pressure (y)
Causal claim: let us make people eat more veggies to get lower bp

Confounder (z) are variables to prevent such claim

In observational data: often

x,y,f,i,N,z – not independent

Framework Components Interact

Each element influences the others
Changes to one affect analysis strategy

x ↔︎ y ↔︎ f ↔︎ i ↔︎ N ↔︎ z

Larger N → more complicated f(x)

Big data
More observations, more variables
f(x), f(z) can be more nuanced
Machine learning.

i,N → x,y

The type of observations we have determine what analysis we can do

Commit level: developer background and behavior
Commits in city: georgraphy of OSS
Commits per day: holidays and OSS
More detailed observations often allow for better identification of causal effects