DSPython Logo DSPython

Linear Regression

Learn to predict continuous values by fitting the 'line of best fit' to your data.

Machine Learning Fundamental 60 min

Topic 1: What is Linear Regression?

Linear Regression is a supervised learning algorithm used to predict a continuous value (like a price, temperature, or score). It's one of the simplest and most interpretable models in machine learning.



It works by finding a linear relationship between one or more input features (X) and a single output target (Y). In its simplest form with one feature, this is the classic "line of best fit" from algebra.

The Equation of a Line:

For a single feature (Simple Linear Regression), the formula is:

Y = β0 + β1X1

  • Y is the value we want to predict (the target).
  • X1 is our input feature.
  • β0 is the intercept: the value of Y when X = 0.
  • β1 is the coefficient (or slope): how much Y changes for a one-unit change in X.

Multiple Linear Regression

Y = β0 + β1X1 + β2X2 + … + βnXn


Topic 2: How Does it "Learn"? (The Cost Function)

How does the model "know" what the "best" coefficients are? It tries to find the line that minimizes the total error.

The "error" for a single data point is called the residual. It's the vertical distance between the actual data point (yi) and the model's predicted line (yi).


Linear Regression Residuals

Cost Function: Mean Squared Error (MSE)

The model's goal is to minimize a "cost function." For linear regression, this is almost always the Mean Squared Error (MSE). The model:

  1. Calculates the residual (error) for every single point in the data.
  2. Squares each error (to make them all positive and to heavily penalize large errors).
  3. Calculates the Mean (average) of all these squared errors.

MSE = (1 / n) Σi=1n(yiyi)2

The "learning" process, called Ordinary Least Squares (OLS), is a mathematical method that finds the exact coefficient values (β0, β1, …) that result in the lowest possible MSE.


Topic 3: Evaluating Your Model (Metrics)

After you've trained your model, how do you know if it's any good? There are two key metrics you must know.

1. Root Mean Squared Error (RMSE)

The MSE is great for training, but its units are squared (e.g., "squared dollars"), which is hard to interpret. To fix this, we just take its square root to get the RMSE.

RMSE = √(MSE)

Interpretation: The RMSE is the average "distance" (error) of your model's predictions from the actual values, in the original units of your target.

An RMSE of 50,000 for house prices means your model is, on average, "off" by about 50,000.

2. R-Squared (R²) — The "Coefficient of Determination"

This is the most popular metric for regression. R² tells you what percentage of the variance in the target variable (Y) is explained by your input features (X).

Interpretation:

  • R² = 1.0: A perfect model — your features explain 100% of the change in the target.
  • R² = 0.65: “Our model explains 65% of the variance in house prices using the features we provided.”
  • R² = 0.0: Your model is no better than simply predicting the average Y value for every point.
  • R² < 0.0: Your model is actively worse than just guessing the average.

Topic 4: Assumptions and Feature Scaling

Linear Regression is powerful, but it relies on a few key assumptions. The most important one is that **the relationship between your features and your target is linear** (i.e., a straight line, not a curve).

Why Feature Scaling Matters

While a simple linear regression will work *without* scaling, it's a critical best practice.

  1. Interpretation: When you use StandardScaler (to give all features a mean of 0 and std dev of 1), you can directly compare the coefficients. The feature with the largest *absolute* coefficient (e.g., -0.8) is the most "important" in the model's decision-making, as all features are on the same scale.
  2. Compatibility: Many other models (like SVMs, Neural Networks, PCA) or advanced regression types (Ridge, Lasso) *require* scaling to work correctly. It's a good habit to always scale your data.
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# 1. Split your data FIRST (to prevent data leakage)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 2. Create the scaler
scaler = StandardScaler()

# 3. Fit the scaler ONLY on X_train
X_train_scaled = scaler.fit_transform(X_train)

# 4. Use that scaler to transform X_test
X_test_scaled = scaler.transform(X_test)

# 5. Now, train your model on X_train_scaled