DSPython Logo DSPython

Model Evaluation & Metrics

Learn to properly assess your model's performance and avoid common pitfalls like data leakage.

Machine Learning Fundamental 60 min

Topic 1: Train-Test Split & Data Leakage

The single most important rule in machine learning is: **Do not evaluate your model on data it has already seen.**

A model that "memorizes" the training data is useless. We want a model that *generalizes* to new, unseen data. To simulate this, we use train_test_split.

The Process:

  1. Split Data: Split your *entire* dataset into `X_train, X_test, y_train, y_test`. (e.g., 80% train, 20% test).
  2. Train: Fit your model *only* on the `_train` data. The model *never* sees the `_test` data.
  3. Evaluate: Make predictions on `X_test` and compare them to `y_test`. This result tells you how your model will *really* perform in the wild.

Data Leakage: The Silent Killer

Data Leakage is when information from your *test* set accidentally "leaks" into your *training* process. This makes your model look perfect during development, but it will fail in production.

Example: Scaling your data *before* you split. The `StandardScaler` would learn the mean/std of the *entire* dataset (including the test data). This is a form of cheating.
The Fix: Always split *first*. Then, `fit_transform` the scaler on `X_train` and *only* `transform` `X_test`.


Topic 2: Classification Metrics

For classification (predicting categories), we need to know what kind of mistakes our model is making.



Accuracy: The Misleading Metric

Accuracy is (Correct Predictions / Total Predictions). It's simple, but dangerous.
The Problem: Imagine a dataset where 99% of emails are "Not Spam" and 1% are "Spam." A model that only predicts "Not Spam" every time would be 99% accurate, but it would be a useless model!
For this, we use the Confusion Matrix.

Confusion Matrix, Precision, Recall, F1-Score

This table breaks down the types of correct and incorrect predictions:

  • True Positive (TP): Actual: 1, Predicted: 1
  • True Negative (TN): Actual: 0, Predicted: 0
  • False Positive (FP): Actual: 0, Predicted: 1 (Type I Error)
  • False Negative (FN): Actual: 1, Predicted: 0 (Type II Error)

The 3 Most Important Metrics:

  • Precision: TP / (TP + FP)
    Of all the times the model predicted "Positive," how many were actually positive? (A measure of "trustworthiness.")
  • Recall: TP / (TP + FN)
    Of all the actual "Positive" cases, how many did the model find? (A measure of "completeness.")
  • F1-Score: 2 × (Precision × Recall) / (Precision + Recall)
    The harmonic mean of Precision and Recall. This is the single best metric for imbalanced classes.

Topic 3: Regression Metrics

For regression (predicting continuous values), we measure the "distance" of the errors.

Mean Squared Error (MSE) & Root Mean Squared Error (RMSE)

Mean Squared Error (MSE):

MSE = (1 / n) × Σ (yactual − ypredicted)2

This is what the model minimizes during training, but its units are squared (for example, "dollars²"), which makes it harder to interpret.

Root Mean Squared Error (RMSE):

RMSE = √MSE

Interpretation: An RMSE of 50,000 for house prices means your model is, on average, wrong by about $50,000.

R-Squared (R²)

R-Squared (also called the Coefficient of Determination) is one of the most popular metrics for explaining a regression model's performance to non-technical audiences.

Interpretation: R² is a percentage (0.0 to 1.0) that tells you how much of the variance in the target variable is explained by your model.

  • R² = 0.85: "Our model explains 85% of the variance in house prices."
  • R² = 0.0: "Our model is no better than just predicting the average price for every house."
  • R² < 0.0: "Our model is actively worse than just predicting the average."

Topic 4: Cross-Validation

A single train-test split is good, but what if you got "lucky" (or unlucky) with your 20% test set? The data in that split might be unusually easy or hard, giving you a misleading score.

Cross-Validation (CV) is a more robust technique that solves this. The most common form is **k-Fold Cross-Validation**.

The k-Fold CV Process (e.g., k=5):

  1. Shuffle & Split: Randomly shuffle your *entire* dataset and split it into 5 equal "folds."
  2. Fold 1: Train the model on Folds 2, 3, 4, 5. Test it on Fold 1. Record the score.
  3. Fold 2: Train the model on Folds 1, 3, 4, 5. Test it on Fold 2. Record the score.
  4. ...and so on...
  5. Fold 5: Train the model on Folds 1, 2, 3, 4. Test it on Fold 5. Record the score.

You now have 5 different scores (e.g., `[0.92, 0.88, 0.95, 0.91, 0.90]`). The mean of these scores (0.912) is a much more stable and reliable estimate of your model's true performance.

from sklearn.model_selection import cross_val_score

# Note: We use the *entire* X and y, as cross_val_score
# handles the splitting internally.
# cv=5 means 5-fold cross-validation.
scores = cross_val_score(model, X_scaled, y, cv=5)

print(f"Mean Accuracy: {scores.mean():.4f}")
print(f"Std Deviation: {scores.std():.4f}")