Employee Attrition Prediction

// Introduction to Machine Learning — HAMK · IBM HR Analytics Dataset · KNN vs Gaussian Naive Bayes

1,470

Total employees

237

Left (Attrition = Yes)

83.9%

Class 0 (Stayed)

Attrition Distribution — Class Imbalance

⚠ Class imbalance: only 16.1% attrition. Accuracy alone misleads — recall for minority class is the critical metric.

Feature Categories (35 features after OHE)

Pipeline Steps

85.0%

Accuracy

11%

Recall (Attrition=Yes)

K = 7

Best K value

Confusion Matrix — KNN (k=7)

Predicted →

TN
233

FP
8

FN
36

TP
17

← Actual

Precision (Yes): 68%
Recall (Yes): 32%
F1 (Yes): 43%
Precision (No): 87%
Recall (No): 97%

Classification Report — KNN

KNN Accuracy vs K value (k = 1 to 20)

Optimal K = 7 selected — highest accuracy 85.0% before plateau. K=1 overfits (high variance); K>15 underfits.

Bias–Variance Trade-off

Low K (K=1–3): High variance, overfits training data. Memorises noise.

Mid K (K=5–9): Best balance. Generalises well to test set.

High K (K≥15): High bias, underfits. Accuracy drops as too many neighbours dilute signal.

Feature Scaling — Why StandardScaler?

KNN is a distance-based algorithm. Without scaling, features with large ranges (e.g. Monthly Income: 1,000–20,000) dominate distance calculations over features with small ranges (e.g. Age: 18–60).

StandardScaler transforms each feature to μ=0, σ=1, giving all features equal weight in the Euclidean distance.

Full ML Pipeline — Python + Scikit-learn

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# ── 1. Load & clean ──────────────────────────────────────────
df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')
df.drop(['EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours'],
      axis=1, inplace=True)          # remove constant columns

# ── 2. Encode target + OHE categorical features ──────────────
df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0})
df = pd.get_dummies(df, drop_first=True)   # avoids dummy variable trap

# ── 3. Train / test split (80 / 20) ─────────────────────────
X = df.drop('Attrition', axis=1)
y = df['Attrition']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ── 4. Scale features (essential for KNN) ───────────────────
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# ── 5. KNN — optimise K ──────────────────────────────────────
accuracies = []
for k in range(1, 21):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_s, y_train)
    accuracies.append(accuracy_score(y_test, knn.predict(X_test_s)))

best_k = accuracies.index(max(accuracies)) + 1  # → k=7

# ── 6. Final KNN model ───────────────────────────────────────
knn_best = KNeighborsClassifier(n_neighbors=7)
knn_best.fit(X_train_s, y_train)
y_pred_knn = knn_best.predict(X_test_s)
print(f"KNN Accuracy: {accuracy_score(y_test, y_pred_knn):.3f}")

# ── 7. Gaussian Naive Bayes ──────────────────────────────────
nb = GaussianNB()
nb.fit(X_train_s, y_train)
y_pred_nb = nb.predict(X_test_s)
print(f"NB Accuracy:  {accuracy_score(y_test, y_pred_nb):.3f}")

# ── 8. Compare ───────────────────────────────────────────────
print(classification_report(y_test, y_pred_knn))
print(classification_report(y_test, y_pred_nb))