Employee Commits Dataset v3.0 - Complete Summary
Version: 3.0 (RECOMMENDED VERSION)
Date: 2025-11-12
Iterations: 4 (v1.0 → v2.0 → v2.1 → v3.0)
Model: Claude Sonnet 4.5 (claude-sonnet-4-5-20250929)
What’s New in v3.0 - KEY IMPROVEMENTS
1. ✅ Rising Productivity in First 2 Years (Ramp-Up)
- Early career (0-2yr): Mean = 9.7 commits/month
- Productivity increases from ~30% to 100% over first 730 days
- Pattern: People are learning, ramping up skills
2. ✅ Smooth Tenure Distribution (No Cliff!)
- Exponential distribution: Most people have low tenure
- Natural decline: 148 employees <1yr, 112 at 1-2yr, 167 at 2-5yr, 115 at 5+yr
- Pattern: Realistic employee turnover and retention
3. ✅ Marked Decline After 7-8 Years
- Late career (7+yr): Mean = 8.7 commits/month (lower than early career!)
- Zero commits: 19.8% of employees with 7+ years (vs 5% baseline)
- Pattern: Move to management, mentoring, architecture roles
The X-Y Relationship: Now MUCH Clearer!
Overall Statistics (Clean Data, N=542)
Correlations: - Pearson (linear): r = 0.019 (still weak in linear sense) - Spearman (rank): r = 0.190 (p < 0.0001, significant!)
Key Insight: The relationship is non-linear and non-monotonic!
The Career Progression Curve
| Career Stage | Tenure | Mean | Median | N | Pattern |
|---|---|---|---|---|---|
| Early (Learning) | 0-1 year | 7.4 | 5.5 | 148 | Low (ramping up) |
| Growing (Maturing) | 1-2 years | 13.0 | 10.5 | 112 | Rising (learning curve) |
| Peak (Full productivity) | 2-5 years | 32.0 | 24.0 | 167 | Maximum (IC contributors) |
| Late (Leadership) | 5+ years | 10.3 | 7.0 | 115 | Declining (management) |
Peak is 4.3x higher than early career!
Late career drops back to early levels!
Statistical Significance
ANOVA for Seniority Effect: - F-statistic: 93.66 - p-value: < 0.0001 - Result: Highly significant differences across career stages
The evidence is overwhelming that career stage matters!
Visualizations Show Clear Patterns
Panel 1: Tenure Distribution (Top Left)
- Smooth exponential decline (no cliff at 2000 days)
- Most employees concentrated in 0-2000 day range
- Natural thinning as tenure increases
Panel 2: Career Progression Curve (Top Right)
RED MEDIAN TREND LINE shows: 1. Rise from 0 to ~1000 days (0-3 years) 2. Peak around 1000-2000 days (3-5 years)
3. Decline after 2500 days (7+ years) 4. Low plateau after 3500+ days (10+ years)
This is the inverted-U pattern that students should discover!
Panel 3: Productivity by Career Stage (Bottom Left)
Boxplots dramatically show: - Early: Median ~5, tight distribution - Peak: Median ~24, wide distribution (high variance) - Late: Median ~7, back down to early levels
Panel 4: Zero Commits Rate (Bottom Right)
The “management transition” effect: - Stable ~5-15% for first 2500 days - Spike to 30-50% after 2500 days (7 years) - Shows clear transition from coding to leadership
Why This is the BEST Teaching Dataset
1. Realistic Career Arc
This pattern matches real-world phenomena: - Academic research output (inverted-U with age) - Sports performance (peak then decline) - Software engineering (senior devs code less) - Management transitions (IC → people leadership)
2. Multiple Valid Analysis Approaches
Students can explore:
Linear approaches: - Simple OLS: R² ≈ 0 (fails completely) - With group dummies: R² ≈ 0.25 (much better!)
Non-linear approaches: - Polynomial regression (quadratic or cubic) - Splines (natural or smoothing splines) - GAM (generalized additive models) - Locally weighted regression (LOESS)
Count data approaches: - Poisson regression - Negative binomial (handles overdispersion) - Zero-inflated models (for late-career zeros)
Group-based approaches: - ANOVA / linear model with tenure groups - Separate models by career stage - Mixed effects with random slopes
3. Forces Critical Thinking
Questions students must grapple with: - Why does simple linear regression fail? - How do we model non-monotonic relationships? - When to use continuous vs categorical predictors? - How to interpret “no linear relationship” vs “no relationship”?
4. Data Quality Challenges
Still includes 58 problematic cases (8 errors + 50 extreme values): - Students must clean data first - Different cleaning decisions affect results - Teaches importance of documentation
Key Statistics Summary
Tenure Distribution (Clean Data)
- Mean: 1,162 days (3.2 years)
- Median: 788 days (2.2 years)
- Range: 1 to 5,475 days (0 to 15 years)
- Most employees: 0-2000 days (70%)
Commits Distribution (Clean Data)
- Mean: 16.8 commits/month
- Median: 10.0 commits/month (right-skewed!)
- Range: 0 to 119 commits/month
- Zero commits: 50 cases (9.2%)
By Department (Clean Data)
- IT: Mean = 18.0, Median = 12.0 (n=307)
- Analytics: Mean = 15.1, Median = 9.0 (n=235)
- Difference: Not statistically significant (p=0.09)
Data Quality Issues
- 8 errors (must fix): negative values, impossibly high/long
- 50 extreme values (analyst judgment): very high commits, zeros with long tenure
- 542 clean observations (90.3%)
Files Provided (v3.0)
Main Data Files
- employee_commits_claude_v3.csv (600 rows)
- Full dataset with metadata
- Columns: i, x, y, department, seniority, role, data_issue, issue_category
- Use for: Teaching with answer key
- employee_commits_raw_v3.csv (600 rows)
- Minimal dataset (i, x, y, department, role only)
- Use for: Give to students first (discovery learning)
- employee_commits_clean_v3.csv (542 rows)
- Only clean observations
- Use for: Analysis after cleaning or skip cleaning exercise
- data_quality_report_v3.csv (58 rows)
- List of all problematic cases
- Use for: Answer key for data cleaning
Visualization Files
- employee_commits_v3_plots.png
- Shows DGP features: ramp-up, peak, decline, zeros
- 4 panels demonstrating career progression
- xy_relationship_analysis_v3.png
- Comprehensive 8-panel analysis
- Shows relationship from multiple angles
- Includes residual diagnostics
Code Files
- generate_data_v3.py
- Complete data generation code
- Fully commented and reproducible
- Change seed or parameters as needed
- analyze_xy_relationship_v3.py
- Relationship analysis script
- Creates 8-panel visualization
- Statistical tests included
Teaching Workflow (Recommended)
Stage 1: Discovery (Raw Data)
- Give students
employee_commits_raw_v3.csv - Ask: “What is the relationship between x and y?”
- Let them explore, visualize, analyze
- They should discover the non-linear pattern!
Stage 2: Data Quality
- Ask: “Are there any data quality issues?”
- Students identify errors and extreme values
- Make and document cleaning decisions
- Compare with
data_quality_report_v3.csv
Stage 3: Multiple Models
- Try various modeling approaches
- Compare: OLS, polynomials, splines, groups, etc.
- Evaluate which model best captures the pattern
- Discuss trade-offs (interpretability vs fit)
Stage 4: Interpretation
- What is the relationship between tenure and commits?
- Why is it non-monotonic?
- What does this tell us about career progression?
- How would you communicate findings to stakeholders?
Key Teaching Points
1. Linear Models Can Miss Important Patterns
- r ≈ 0 doesn’t mean “no relationship”
- It means “no linear relationship”
- Always visualize first!
2. Non-Monotonic Relationships Are Common
- Not everything increases or decreases monotonically
- Career arcs, learning curves, life cycles all have peaks
- Need appropriate modeling approaches
3. Group-Based Analysis Often Better
- Categorical predictors (seniority) explain more than continuous (days)
- Domain knowledge helps create meaningful groups
- Sometimes simpler is better (interpretability)
4. Count Data Has Special Properties
- Discrete, non-negative, right-skewed
- Poisson/NB often better than OLS
- Zero-inflation is a real phenomenon
5. Context Matters
- The inverted-U makes perfect sense for careers
- Early: learning and ramping up
- Peak: full IC productivity
- Late: transitioning to leadership
- Domain knowledge guides modeling choices
Comparison: v2.1 vs v3.0
| Feature | v2.1 | v3.0 | Impact |
|---|---|---|---|
| Early career | Flat/low | Rising | More realistic learning curve |
| Tenure distribution | Hard cutoffs | Smooth exponential | No artificial cliff |
| Late career | Moderate decline | Marked decline + zeros | Clear management transition |
| X-Y correlation | -0.005 | 0.019 | Slightly positive (better) |
| Spearman correlation | 0.120 | 0.190 | Stronger non-linear signal |
| Seniority F-stat | 35.5 | 93.7 | Much stronger group effects |
| Pedagogical clarity | Good | Excellent | Pattern unmistakable |
v3.0 is the recommended version for teaching!
Extensions and Variations
Easy Modifications (change in generate_data_v3.py):
- Different ramp-up speed: Change
x/730factor - Earlier/later peak: Adjust 2500 day threshold
- Steeper decline: Increase 0.12 decline factor
- More zeros: Increase 0.07 probability increment
- Different roles: Modify base_monthly_rates
- More departments: Add to dept_split
Advanced Extensions:
- Time series: Multiple observations per employee
- Team effects: Add team ID with random effects
- Project complexity: Add covariate affecting commits
- Turnover: Some employees leave (censoring)
- Promotions: Explicit seniority changes over time
Technical Details
DGP Specifications (v3.0)
Tenure Distribution:
days_with_company ~ Exponential(mean=1200)
Truncated: [1, 5475] days
Productivity Function:
ramp_factor = {
0.3 + 0.7*(x/730) if x < 730
1.0 if 730 ≤ x < 2500
max(0.25, 1.0 - 0.12*years) if x ≥ 2500
}
lambda = base_rate × seniority_mult × ramp_factor × individual_effect
y ~ NegativeBinomial(mu=lambda, size=5)
Zero Inflation:
P(y=0) = {
0.05 if x < 2500
min(0.35, 0.05 + 0.07*years) if x ≥ 2500
}
Random Seed
np.random.seed(20251112) - fully reproducible
Software Requirements
- Python 3.x
- numpy, pandas, matplotlib, scipy, sklearn
Bottom Line
v3.0 creates a realistic dataset that:
✅ Shows clear non-monotonic career progression
✅ Challenges students’ intuitions about linearity
✅ Requires sophisticated thinking about modeling
✅ Reflects real-world career dynamics
✅ Provides rich opportunities for exploration
✅ Has no single “right” answer (by design)
✅ Teaches both technical skills and critical thinking
Perfect for teaching “What is the relationship between x and y?”
Quick Start
- Give students:
employee_commits_raw_v3.csv - Ask: “What is the relationship between tenure (x) and monthly commits (y)?”
- Let them explore and struggle (this is where learning happens!)
- Reveal patterns using:
employee_commits_v3_plots.pngandxy_relationship_analysis_v3.png - Discuss: Why did simple approaches fail? What models work better?
Welcome to the wonderful world of non-linear relationships!