Employee Commits Dataset v2.1 - Summary

Version: 2.1
Date: 2025-11-12
Iterations: 3 (v1.0 -> v2.0 -> v2.1) Model: Claude Sonnet 4.5 (claude-sonnet-4-5-20250929)

What Changed in v2.1

Increased extreme values from 10 to 50: - 30 very high commits (range: 200-450/month) — up from 3 - 10 zero commits with long tenure — up from 3 - 5 high commits for juniors — up from 2 - 5 high commits for infrastructure — up from 2

Data errors remain at 8 (unchanged from v2.0)

Total problematic cases: 58 (8 errors + 50 extreme values) Clean observations: 542 Total: 600 employees

Files Provided

1. employee_commits_claude.csv

Full dataset with all metadata (600 rows × 8 columns)

Columns: - i - Employee ID - x - Days with company - y - Monthly commits - department - IT or Analytics - seniority - Junior, Mid, Senior - role - Specific job role - data_issue - Description of any quality issues - issue_category - Clean, Data Error, or Extreme Value

Use this for teaching data quality and cleaning decisions.

2. employee_commits_raw.csv

Minimal dataset (600 rows × 5 columns)

Columns: - i - Employee ID - x - Days with company - y - Monthly commits - department - IT or Analytics - role - Specific job role

Dropped columns: seniority, data_issue, issue_category

Use this as the “raw data” students receive before they know about issues.

3. employee_commits_clean.csv

Only clean observations (542 rows)

All 8 data errors and 50 extreme values removed. Use for comparison or to skip the cleaning exercise.

4. data_quality_report.csv

List of all 58 problematic cases

Shows each issue with employee ID, x, y, and explanation. Use as an answer key or for discussion.

5. employee_commits_v2_plots.png

Diagnostic visualizations

Four panels showing data quality issues highlighted: - Full view with errors (X) and extreme values (★) - Zoomed view focusing on reasonable range - Distribution of monthly commits - Boxplots by role

Data Quality Issues Breakdown

Data Errors (8 cases) - Must Fix

Issue Type	Count	Range	Story
Negative commits	2	-15, -8	Data pipeline bug
Impossibly high commits	3	6,800 - 12,000	Decimal/aggregation error
Negative tenure	1	-120 days	Date calculation error
Impossibly long tenure	2	19,500 - 22,000 days	Date entry error (1925 vs 2025)

Extreme Values (50 cases) - Require Judgment

Issue Type	Count	Range	Story	Keep or Drop?
Very high commits	30	200 - 450/month	Bot-assisted or granular style	Analyst choice
Zero commits, long tenure	10	x > 2,900 days	Moved to management	Analyst choice
High commits for junior	5	154 - 248/month	Internal transfers	Analyst choice
High for infrastructure	5	75 - 108/month	IaC enthusiasts	Analyst choice

Key Statistics

Full Dataset (including issues)

x: Mean = 1,496 days, Min = -120, Max = 22,000
y: Mean = 87, Min = -15, Max = 12,000

Clean Data Only (542 obs)

x: Mean = 1,434 days (3.9 years), Range = 3 to 5,446
y: Median = 18, Mean = 42, Range = 0 to 445

With Extreme Values Removed (542 obs)

y: Median = 15, Mean = 27, Range = 0 to 150 (approximately)

Teaching Questions

Level 1: Data Quality

Identify which observations are clearly errors vs possibly valid
What’s your decision rule for each type of extreme value?
Document the impact of including vs excluding extreme values

Level 2: Exploration

Visualize the relationship between x and y
How does it differ by department? By role?
Is the relationship linear? Log-linear? Something else?

Level 3: Modeling

Compare multiple approaches:
- Linear regression (OLS)
- Log transformations
- Poisson regression
- Negative binomial regression
- Robust regression (M-estimators)
- Quantile regression (median)
Which model assumptions are violated and why?
How sensitive are results to data cleaning decisions?

Level 4: Communication

What’s your conclusion about the x-y relationship?
How certain are you? What are the caveats?
What additional data would help?
How would you report this to a non-technical stakeholder?

Suggested Workflow for Students

Stage 1: Initial Exploration - Load employee_commits_raw.csv - Create basic visualizations - Calculate summary statistics - Notice anything unusual?

Stage 2: Data Quality Assessment - Identify outliers and unusual values - Categorize as errors vs extreme values - Make cleaning decisions with justification

Stage 3: Multiple Analyses - Try 3-4 different cleaning approaches - For each: visualize, model, interpret - Compare results across approaches

Stage 4: Synthesis - Which approach do you recommend and why? - What is the relationship between x and y? - How confident are you in this conclusion?

Instructor Notes

Why This Dataset Works

Realistic complexity: Not obviously clean or dirty
Ambiguity: No single “correct” answer for extreme values
Multiple valid approaches: Forces critical thinking
Count data features: Overdispersion, zeros, skewness
Rich metadata: Can explore heterogeneity

Discussion Points

Error vs extreme: What makes something an error vs just unusual?
Decision rules: How do you decide what to keep/drop?
Transparency: How do you document cleaning decisions?
Sensitivity: When do conclusions change materially?
Model choice: Why prefer one model over another for count data?

Extensions

Add time dimension (multiple months per employee)
Include team/project identifiers for clustering
Add more covariates (education, experience, etc.)
Create a prediction task (predict commits for new employees)

Technical Details

Seed: 20251112
Software: Python 3.x with numpy, pandas, matplotlib
Also available: R version (simulate_employee_commits_v2.R)

Key assumptions: - Base monthly rates by role (7-38 commits/month) - Seniority multipliers (0.6x, 1.2x, 0.8x) - Individual heterogeneity (log-normal, σ=0.45) - Negative binomial distribution (size=5) - 5% baseline zeros (moved teams, admin roles, etc.)

For questions or modifications, see generate_data_v2.1.py

# Employee Commits Dataset v2.1 - Summary **Version**: 2.1 **Date**: 2025-11-12 **Iterations**: 3 (v1.0 -> v2.0 -> v2.1) **Model**: Claude Sonnet 4.5 (claude-sonnet-4-5-20250929) --- ## What Changed in v2.1 **Increased extreme values from 10 to 50:** - 30 very high commits (range: 200-450/month) — up from 3 - 10 zero commits with long tenure — up from 3 - 5 high commits for juniors — up from 2 - 5 high commits for infrastructure — up from 2 **Data errors remain at 8** (unchanged from v2.0) **Total problematic cases: 58** (8 errors + 50 extreme values) **Clean observations: 542** **Total: 600 employees** --- ## Files Provided ### 1. employee_commits_claude.csv **Full dataset with all metadata (600 rows × 8 columns)** Columns: - `i` - Employee ID - `x` - Days with company - `y` - Monthly commits - `department` - IT or Analytics - `seniority` - Junior, Mid, Senior - `role` - Specific job role - `data_issue` - Description of any quality issues - `issue_category` - Clean, Data Error, or Extreme Value Use this for teaching data quality and cleaning decisions. ### 2. employee_commits_raw.csv **Minimal dataset (600 rows × 5 columns)** Columns: - `i` - Employee ID - `x` - Days with company - `y` - Monthly commits - `department` - IT or Analytics - `role` - Specific job role **Dropped columns**: seniority, data_issue, issue_category Use this as the "raw data" students receive before they know about issues. ### 3. employee_commits_clean.csv **Only clean observations (542 rows)** All 8 data errors and 50 extreme values removed. Use for comparison or to skip the cleaning exercise. ### 4. data_quality_report.csv **List of all 58 problematic cases** Shows each issue with employee ID, x, y, and explanation. Use as an answer key or for discussion. ### 5. employee_commits_v2_plots.png **Diagnostic visualizations** Four panels showing data quality issues highlighted: - Full view with errors (X) and extreme values (★) - Zoomed view focusing on reasonable range - Distribution of monthly commits - Boxplots by role --- ## Data Quality Issues Breakdown ### Data Errors (8 cases) - Must Fix | Issue Type | Count | Range | Story | |------------|-------|-------|-------| | Negative commits | 2 | -15, -8 | Data pipeline bug | | Impossibly high commits | 3 | 6,800 - 12,000 | Decimal/aggregation error | | Negative tenure | 1 | -120 days | Date calculation error | | Impossibly long tenure | 2 | 19,500 - 22,000 days | Date entry error (1925 vs 2025) | ### Extreme Values (50 cases) - Require Judgment | Issue Type | Count | Range | Story | Keep or Drop? | |------------|-------|-------|-------|---------------| | Very high commits | 30 | 200 - 450/month | Bot-assisted or granular style | Analyst choice | | Zero commits, long tenure | 10 | x > 2,900 days | Moved to management | Analyst choice | | High commits for junior | 5 | 154 - 248/month | Internal transfers | Analyst choice | | High for infrastructure | 5 | 75 - 108/month | IaC enthusiasts | Analyst choice | --- ## Key Statistics ### Full Dataset (including issues) - **x**: Mean = 1,496 days, Min = -120, Max = 22,000 - **y**: Mean = 87, Min = -15, Max = 12,000 ### Clean Data Only (542 obs) - **x**: Mean = 1,434 days (3.9 years), Range = 3 to 5,446 - **y**: Median = 18, Mean = 42, Range = 0 to 445 ### With Extreme Values Removed (542 obs) - **y**: Median = 15, Mean = 27, Range = 0 to 150 (approximately) --- ## Teaching Questions ### Level 1: Data Quality 1. Identify which observations are clearly errors vs possibly valid 2. What's your decision rule for each type of extreme value? 3. Document the impact of including vs excluding extreme values ### Level 2: Exploration 1. Visualize the relationship between x and y 2. How does it differ by department? By role? 3. Is the relationship linear? Log-linear? Something else? ### Level 3: Modeling 1. Compare multiple approaches: - Linear regression (OLS) - Log transformations - Poisson regression - Negative binomial regression - Robust regression (M-estimators) - Quantile regression (median) 2. Which model assumptions are violated and why? 3. How sensitive are results to data cleaning decisions? ### Level 4: Communication 1. What's your conclusion about the x-y relationship? 2. How certain are you? What are the caveats? 3. What additional data would help? 4. How would you report this to a non-technical stakeholder? --- ## Suggested Workflow for Students **Stage 1: Initial Exploration** - Load `employee_commits_raw.csv` - Create basic visualizations - Calculate summary statistics - Notice anything unusual? **Stage 2: Data Quality Assessment** - Identify outliers and unusual values - Categorize as errors vs extreme values - Make cleaning decisions with justification **Stage 3: Multiple Analyses** - Try 3-4 different cleaning approaches - For each: visualize, model, interpret - Compare results across approaches **Stage 4: Synthesis** - Which approach do you recommend and why? - What is the relationship between x and y? - How confident are you in this conclusion? --- ## Instructor Notes ### Why This Dataset Works 1. **Realistic complexity**: Not obviously clean or dirty 2. **Ambiguity**: No single "correct" answer for extreme values 3. **Multiple valid approaches**: Forces critical thinking 4. **Count data features**: Overdispersion, zeros, skewness 5. **Rich metadata**: Can explore heterogeneity ### Discussion Points - **Error vs extreme**: What makes something an error vs just unusual? - **Decision rules**: How do you decide what to keep/drop? - **Transparency**: How do you document cleaning decisions? - **Sensitivity**: When do conclusions change materially? - **Model choice**: Why prefer one model over another for count data? ### Extensions 1. Add time dimension (multiple months per employee) 2. Include team/project identifiers for clustering 3. Add more covariates (education, experience, etc.) 4. Create a prediction task (predict commits for new employees) --- ## Technical Details **Seed**: 20251112 **Software**: Python 3.x with numpy, pandas, matplotlib **Also available**: R version (simulate_employee_commits_v2.R) **Key assumptions**: - Base monthly rates by role (7-38 commits/month) - Seniority multipliers (0.6x, 1.2x, 0.8x) - Individual heterogeneity (log-normal, σ=0.45) - Negative binomial distribution (size=5) - 5% baseline zeros (moved teams, admin roles, etc.) --- For questions or modifications, see `generate_data_v2.1.py`