Python · Scikit-learn · KNN · Naive Bayes · IBM HR Dataset ← Portfolio

Employee Attrition Prediction

// Introduction to Machine Learning — HAMK · IBM HR Analytics Dataset · KNN vs Gaussian Naive Bayes

1,470
Total employees
237
Left (Attrition = Yes)
83.9%
Class 0 (Stayed)
Attrition Distribution — Class Imbalance

⚠ Class imbalance: only 16.1% attrition. Accuracy alone misleads — recall for minority class is the critical metric.

Feature Categories (35 features after OHE)
Pipeline Steps
85.0%
Accuracy
11%
Recall (Attrition=Yes)
K = 7
Best K value
Confusion Matrix — KNN (k=7)
Predicted →
TN
233
FP
8
FN
36
TP
17
← Actual
Precision (Yes): 68%
Recall (Yes): 32%
F1 (Yes): 43%
Precision (No): 87%
Recall (No): 97%
Classification Report — KNN
KNN Accuracy vs K value (k = 1 to 20)

Optimal K = 7 selected — highest accuracy 85.0% before plateau. K=1 overfits (high variance); K>15 underfits.

Bias–Variance Trade-off
Low K (K=1–3): High variance, overfits training data. Memorises noise.
Mid K (K=5–9): Best balance. Generalises well to test set.
High K (K≥15): High bias, underfits. Accuracy drops as too many neighbours dilute signal.
Feature Scaling — Why StandardScaler?
KNN is a distance-based algorithm. Without scaling, features with large ranges (e.g. Monthly Income: 1,000–20,000) dominate distance calculations over features with small ranges (e.g. Age: 18–60).

StandardScaler transforms each feature to μ=0, σ=1, giving all features equal weight in the Euclidean distance.
Full ML Pipeline — Python + Scikit-learn
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import GaussianNB from sklearn.metrics import accuracy_score, confusion_matrix, classification_report # ── 1. Load & clean ────────────────────────────────────────── df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv') df.drop(['EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours'], axis=1, inplace=True) # remove constant columns # ── 2. Encode target + OHE categorical features ────────────── df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0}) df = pd.get_dummies(df, drop_first=True) # avoids dummy variable trap # ── 3. Train / test split (80 / 20) ───────────────────────── X = df.drop('Attrition', axis=1) y = df['Attrition'] X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # ── 4. Scale features (essential for KNN) ─────────────────── scaler = StandardScaler() X_train_s = scaler.fit_transform(X_train) X_test_s = scaler.transform(X_test) # ── 5. KNN — optimise K ────────────────────────────────────── accuracies = [] for k in range(1, 21): knn = KNeighborsClassifier(n_neighbors=k) knn.fit(X_train_s, y_train) accuracies.append(accuracy_score(y_test, knn.predict(X_test_s))) best_k = accuracies.index(max(accuracies)) + 1 # → k=7 # ── 6. Final KNN model ─────────────────────────────────────── knn_best = KNeighborsClassifier(n_neighbors=7) knn_best.fit(X_train_s, y_train) y_pred_knn = knn_best.predict(X_test_s) print(f"KNN Accuracy: {accuracy_score(y_test, y_pred_knn):.3f}") # ── 7. Gaussian Naive Bayes ────────────────────────────────── nb = GaussianNB() nb.fit(X_train_s, y_train) y_pred_nb = nb.predict(X_test_s) print(f"NB Accuracy: {accuracy_score(y_test, y_pred_nb):.3f}") # ── 8. Compare ─────────────────────────────────────────────── print(classification_report(y_test, y_pred_knn)) print(classification_report(y_test, y_pred_nb))