DSPython Logo DSPython

Logistic Regression

Learn to predict probabilities and classes, and evaluate your model with the Confusion Matrix.

Machine Learning Fundamental 60 min

Topic 1: What is Logistic Regression?

Despite its name, **Logistic Regression** is a **supervised learning** algorithm used for **classification**, not regression. It's one of the most fundamental and widely-used classification models.


It's used to answer "yes/no" questions, such as "Will this customer churn?" (Yes/No) or "Is this email spam?" (Spam/Not Spam).

The Core Idea: Predicting Probability

The "trick" of logistic regression is that it doesn't just output a "Yes" or "No." It outputs the **probability** (a value between 0.0 and 1.0) that the answer is "Yes."

For example, it won't just say "Spam." It will say: "I am 85% (0.85) sure this is Spam."

It does this by taking the output of a standard *Linear* Regression ($Y = \beta_0 + \beta_1X_1 + ...$) and feeding that output into a special function called the **Sigmoid function**.


Topic 2: The Sigmoid Function

The Sigmoid (or "Logistic") function is the heart of this algorithm. It's a simple mathematical function that squashes any number (from negative infinity to positive infinity) into a value between 0 and 1.

This is perfect for probabilities!

Here is the process:

  1. The model calculates a "score" (z) using a linear regression formula: z = β0 + β1X1 + β2X2
  2. If z is a large positive number, the model is very confident of a "Yes."
  3. If z is a large negative number, the model is very confident of a "No."
  4. This score z is passed into the sigmoid function: P(Y = 1) = 1 / (1 + e−z)
  5. The output P(Y = 1) is the probability of the class being "1" (e.g., "Spam").

Topic 3: The Decision Boundary (Threshold)

The model gives you a probability, like 0.85. But how do you make the final "Spam" or "Not Spam" decision?

You use a Decision Boundary (or Threshold). By default, this threshold is 0.5.

Default Decision Rule (Threshold = 0.5):

  • If model.predict_proba() outputs a probability ≥ 0.5, predict Class 1 ("Yes").
  • If model.predict_proba() outputs a probability < 0.5, predict Class 0 ("No").

This threshold is a hyperparameter you can tune. For example, in a medical test for a dangerous disease, you might lower the threshold to 0.3. This would mean you classify more people as "at risk" (getting more False Positives), but you are less likely to miss an actual case (fewer False Negatives).


Topic 4: Evaluating a Classification Model

We can't use R² or RMSE for classification. Instead, we use a different set of metrics derived from the Confusion Matrix.

The Confusion Matrix

This is a table that shows your model's performance by comparing its predictions to the actual labels.

  • True Positive (TP): Actual: Yes, Predicted: Yes. (Correct!)
  • True Negative (TN): Actual: No, Predicted: No. (Correct!)
  • False Positive (FP): Actual: No, Predicted: Yes. (Wrong - Type I Error)
  • False Negative (FN): Actual: Yes, Predicted: No. (Wrong - Type II Error)

Key Metrics from the Matrix:

  • Accuracy: (TP + TN) / Total — The percentage of all predictions that were correct. Simple, but misleading if your classes are imbalanced (e.g., 99% "No", 1% "Yes").
  • Precision: TP / (TP + FP) — Of all the times the model predicted "Yes," what percentage was correct? (A metric of "trustworthiness.")
  • Recall: TP / (TP + FN) — Of all the actual "Yes" cases, what percentage did the model find? (A metric of "completeness.")

sklearn.metrics.classification_report gives you all of these metrics at once!