Employee Attrition Prediction
// Introduction to Machine Learning — HAMK · IBM HR Analytics Dataset · KNN vs Gaussian Naive Bayes
1,470
Total employees
237
Left (Attrition = Yes)
83.9%
Class 0 (Stayed)
Attrition Distribution — Class Imbalance
⚠ Class imbalance: only 16.1% attrition. Accuracy alone misleads — recall for minority class is the critical metric.
Feature Categories (35 features after OHE)
Pipeline Steps
85.0%
Accuracy
11%
Recall (Attrition=Yes)
K = 7
Best K value
Confusion Matrix — KNN (k=7)
Predicted →
TN
233
233
FP
8
8
FN
36
36
TP
17
17
← Actual
Precision (Yes): 68%
Recall (Yes): 32%
F1 (Yes): 43%
Precision (No): 87%
Recall (No): 97%
Classification Report — KNN
KNN Accuracy vs K value (k = 1 to 20)
Optimal K = 7 selected — highest accuracy 85.0% before plateau. K=1 overfits (high variance); K>15 underfits.
Bias–Variance Trade-off
Low K (K=1–3): High variance, overfits training data. Memorises noise.
Mid K (K=5–9): Best balance. Generalises well to test set.
High K (K≥15): High bias, underfits. Accuracy drops as too many neighbours dilute signal.
Feature Scaling — Why StandardScaler?
KNN is a distance-based algorithm. Without scaling, features with large ranges (e.g. Monthly Income: 1,000–20,000) dominate distance calculations over features with small ranges (e.g. Age: 18–60).
StandardScaler transforms each feature to μ=0, σ=1, giving all features equal weight in the Euclidean distance.
StandardScaler transforms each feature to μ=0, σ=1, giving all features equal weight in the Euclidean distance.
Full ML Pipeline — Python + Scikit-learn
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# ── 1. Load & clean ──────────────────────────────────────────
df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')
df.drop(['EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours'],
axis=1, inplace=True) # remove constant columns
# ── 2. Encode target + OHE categorical features ──────────────
df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0})
df = pd.get_dummies(df, drop_first=True) # avoids dummy variable trap
# ── 3. Train / test split (80 / 20) ─────────────────────────
X = df.drop('Attrition', axis=1)
y = df['Attrition']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# ── 4. Scale features (essential for KNN) ───────────────────
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# ── 5. KNN — optimise K ──────────────────────────────────────
accuracies = []
for k in range(1, 21):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train_s, y_train)
accuracies.append(accuracy_score(y_test, knn.predict(X_test_s)))
best_k = accuracies.index(max(accuracies)) + 1 # → k=7
# ── 6. Final KNN model ───────────────────────────────────────
knn_best = KNeighborsClassifier(n_neighbors=7)
knn_best.fit(X_train_s, y_train)
y_pred_knn = knn_best.predict(X_test_s)
print(f"KNN Accuracy: {accuracy_score(y_test, y_pred_knn):.3f}")
# ── 7. Gaussian Naive Bayes ──────────────────────────────────
nb = GaussianNB()
nb.fit(X_train_s, y_train)
y_pred_nb = nb.predict(X_test_s)
print(f"NB Accuracy: {accuracy_score(y_test, y_pred_nb):.3f}")
# ── 8. Compare ───────────────────────────────────────────────
print(classification_report(y_test, y_pred_knn))
print(classification_report(y_test, y_pred_nb))