Model Evaluation & Metrics
Learn to properly assess your model's performance and avoid common pitfalls like data leakage.
Topic 1: Train-Test Split & Data Leakage
The single most important rule in machine learning is: **Do not evaluate your model on data it has already seen.**
A model that "memorizes" the training data is useless. We want a model that *generalizes* to new, unseen data. To simulate this, we use train_test_split.
The Process:
- Split Data: Split your *entire* dataset into `X_train, X_test, y_train, y_test`. (e.g., 80% train, 20% test).
- Train: Fit your model *only* on the `_train` data. The model *never* sees the `_test` data.
- Evaluate: Make predictions on `X_test` and compare them to `y_test`. This result tells you how your model will *really* perform in the wild.
Data Leakage: The Silent Killer
Data Leakage is when information from your *test* set accidentally "leaks" into your *training* process. This makes your model look perfect during development, but it will fail in production.
Example: Scaling your data *before* you split. The `StandardScaler` would learn the mean/std of the *entire* dataset (including the test data). This is a form of cheating.
The Fix: Always split *first*. Then, `fit_transform` the scaler on `X_train` and *only* `transform` `X_test`.
Topic 2: Classification Metrics
For classification (predicting categories), we need to know what kind of mistakes our model is making.

Accuracy: The Misleading Metric
Accuracy is (Correct Predictions / Total Predictions). It's simple, but dangerous.
The Problem: Imagine a dataset where 99% of emails are "Not Spam" and 1% are "Spam." A model that only predicts "Not Spam" every time would be 99% accurate, but it would be a useless model!
For this, we use the Confusion Matrix.
Confusion Matrix, Precision, Recall, F1-Score
This table breaks down the types of correct and incorrect predictions:
- True Positive (TP): Actual: 1, Predicted: 1
- True Negative (TN): Actual: 0, Predicted: 0
- False Positive (FP): Actual: 0, Predicted: 1 (Type I Error)
- False Negative (FN): Actual: 1, Predicted: 0 (Type II Error)
The 3 Most Important Metrics:
- Precision: TP / (TP + FP)
Of all the times the model predicted "Positive," how many were actually positive? (A measure of "trustworthiness.") - Recall: TP / (TP + FN)
Of all the actual "Positive" cases, how many did the model find? (A measure of "completeness.") - F1-Score: 2 × (Precision × Recall) / (Precision + Recall)
The harmonic mean of Precision and Recall. This is the single best metric for imbalanced classes.
Topic 3: Regression Metrics
For regression (predicting continuous values), we measure the "distance" of the errors.
Mean Squared Error (MSE) & Root Mean Squared Error (RMSE)
Mean Squared Error (MSE):
MSE = (1 / n) × Σ (yactual − ypredicted)2
This is what the model minimizes during training, but its units are squared (for example, "dollars²"), which makes it harder to interpret.
Root Mean Squared Error (RMSE):
RMSE = √MSE
Interpretation: An RMSE of 50,000 for house prices means your model is, on average, wrong by about $50,000.
R-Squared (R²)
R-Squared (also called the Coefficient of Determination) is one of the most popular metrics for explaining a regression model's performance to non-technical audiences.
Interpretation: R² is a percentage (0.0 to 1.0) that tells you how much of the variance in the target variable is explained by your model.
- R² = 0.85: "Our model explains 85% of the variance in house prices."
- R² = 0.0: "Our model is no better than just predicting the average price for every house."
- R² < 0.0: "Our model is actively worse than just predicting the average."
Topic 4: Cross-Validation
A single train-test split is good, but what if you got "lucky" (or unlucky) with your 20% test set? The data in that split might be unusually easy or hard, giving you a misleading score.
Cross-Validation (CV) is a more robust technique that solves this. The most common form is **k-Fold Cross-Validation**.
The k-Fold CV Process (e.g., k=5):
- Shuffle & Split: Randomly shuffle your *entire* dataset and split it into 5 equal "folds."
- Fold 1: Train the model on Folds 2, 3, 4, 5. Test it on Fold 1. Record the score.
- Fold 2: Train the model on Folds 1, 3, 4, 5. Test it on Fold 2. Record the score.
- ...and so on...
- Fold 5: Train the model on Folds 1, 2, 3, 4. Test it on Fold 5. Record the score.
You now have 5 different scores (e.g., `[0.92, 0.88, 0.95, 0.91, 0.90]`). The mean of these scores (0.912) is a much more stable and reliable estimate of your model's true performance.
from sklearn.model_selection import cross_val_score
# Note: We use the *entire* X and y, as cross_val_score
# handles the splitting internally.
# cv=5 means 5-fold cross-validation.
scores = cross_val_score(model, X_scaled, y, cv=5)
print(f"Mean Accuracy: {scores.mean():.4f}")
print(f"Std Deviation: {scores.std():.4f}")