Employee Commits Dataset v2.1 - Summary
Version: 2.1
Date: 2025-11-12
Iterations: 3 (v1.0 -> v2.0 -> v2.1) Model: Claude Sonnet 4.5 (claude-sonnet-4-5-20250929)
What Changed in v2.1
Increased extreme values from 10 to 50: - 30 very high commits (range: 200-450/month) — up from 3 - 10 zero commits with long tenure — up from 3 - 5 high commits for juniors — up from 2 - 5 high commits for infrastructure — up from 2
Data errors remain at 8 (unchanged from v2.0)
Total problematic cases: 58 (8 errors + 50 extreme values) Clean observations: 542 Total: 600 employees
Files Provided
1. employee_commits_claude.csv
Full dataset with all metadata (600 rows × 8 columns)
Columns: - i - Employee ID - x - Days with company - y - Monthly commits - department - IT or Analytics - seniority - Junior, Mid, Senior - role - Specific job role - data_issue - Description of any quality issues - issue_category - Clean, Data Error, or Extreme Value
Use this for teaching data quality and cleaning decisions.
2. employee_commits_raw.csv
Minimal dataset (600 rows × 5 columns)
Columns: - i - Employee ID - x - Days with company - y - Monthly commits - department - IT or Analytics - role - Specific job role
Dropped columns: seniority, data_issue, issue_category
Use this as the “raw data” students receive before they know about issues.
3. employee_commits_clean.csv
Only clean observations (542 rows)
All 8 data errors and 50 extreme values removed. Use for comparison or to skip the cleaning exercise.
4. data_quality_report.csv
List of all 58 problematic cases
Shows each issue with employee ID, x, y, and explanation. Use as an answer key or for discussion.
5. employee_commits_v2_plots.png
Diagnostic visualizations
Four panels showing data quality issues highlighted: - Full view with errors (X) and extreme values (★) - Zoomed view focusing on reasonable range - Distribution of monthly commits - Boxplots by role
Data Quality Issues Breakdown
Data Errors (8 cases) - Must Fix
| Issue Type | Count | Range | Story |
|---|---|---|---|
| Negative commits | 2 | -15, -8 | Data pipeline bug |
| Impossibly high commits | 3 | 6,800 - 12,000 | Decimal/aggregation error |
| Negative tenure | 1 | -120 days | Date calculation error |
| Impossibly long tenure | 2 | 19,500 - 22,000 days | Date entry error (1925 vs 2025) |
Extreme Values (50 cases) - Require Judgment
| Issue Type | Count | Range | Story | Keep or Drop? |
|---|---|---|---|---|
| Very high commits | 30 | 200 - 450/month | Bot-assisted or granular style | Analyst choice |
| Zero commits, long tenure | 10 | x > 2,900 days | Moved to management | Analyst choice |
| High commits for junior | 5 | 154 - 248/month | Internal transfers | Analyst choice |
| High for infrastructure | 5 | 75 - 108/month | IaC enthusiasts | Analyst choice |
Key Statistics
Full Dataset (including issues)
- x: Mean = 1,496 days, Min = -120, Max = 22,000
- y: Mean = 87, Min = -15, Max = 12,000
Clean Data Only (542 obs)
- x: Mean = 1,434 days (3.9 years), Range = 3 to 5,446
- y: Median = 18, Mean = 42, Range = 0 to 445
With Extreme Values Removed (542 obs)
- y: Median = 15, Mean = 27, Range = 0 to 150 (approximately)
Teaching Questions
Level 1: Data Quality
- Identify which observations are clearly errors vs possibly valid
- What’s your decision rule for each type of extreme value?
- Document the impact of including vs excluding extreme values
Level 2: Exploration
- Visualize the relationship between x and y
- How does it differ by department? By role?
- Is the relationship linear? Log-linear? Something else?
Level 3: Modeling
- Compare multiple approaches:
- Linear regression (OLS)
- Log transformations
- Poisson regression
- Negative binomial regression
- Robust regression (M-estimators)
- Quantile regression (median)
- Which model assumptions are violated and why?
- How sensitive are results to data cleaning decisions?
Level 4: Communication
- What’s your conclusion about the x-y relationship?
- How certain are you? What are the caveats?
- What additional data would help?
- How would you report this to a non-technical stakeholder?
Suggested Workflow for Students
Stage 1: Initial Exploration - Load employee_commits_raw.csv - Create basic visualizations - Calculate summary statistics - Notice anything unusual?
Stage 2: Data Quality Assessment - Identify outliers and unusual values - Categorize as errors vs extreme values - Make cleaning decisions with justification
Stage 3: Multiple Analyses - Try 3-4 different cleaning approaches - For each: visualize, model, interpret - Compare results across approaches
Stage 4: Synthesis - Which approach do you recommend and why? - What is the relationship between x and y? - How confident are you in this conclusion?
Instructor Notes
Why This Dataset Works
- Realistic complexity: Not obviously clean or dirty
- Ambiguity: No single “correct” answer for extreme values
- Multiple valid approaches: Forces critical thinking
- Count data features: Overdispersion, zeros, skewness
- Rich metadata: Can explore heterogeneity
Discussion Points
- Error vs extreme: What makes something an error vs just unusual?
- Decision rules: How do you decide what to keep/drop?
- Transparency: How do you document cleaning decisions?
- Sensitivity: When do conclusions change materially?
- Model choice: Why prefer one model over another for count data?
Extensions
- Add time dimension (multiple months per employee)
- Include team/project identifiers for clustering
- Add more covariates (education, experience, etc.)
- Create a prediction task (predict commits for new employees)
Technical Details
Seed: 20251112
Software: Python 3.x with numpy, pandas, matplotlib
Also available: R version (simulate_employee_commits_v2.R)
Key assumptions: - Base monthly rates by role (7-38 commits/month) - Seniority multipliers (0.6x, 1.2x, 0.8x) - Individual heterogeneity (log-normal, σ=0.45) - Negative binomial distribution (size=5) - 5% baseline zeros (moved teams, admin roles, etc.)
For questions or modifications, see generate_data_v2.1.py