DSPython - Linear Regression

Topic 1: What is Linear Regression?

Linear Regression is a supervised learning algorithm used to predict a continuous value (like a price, temperature, or score). It's one of the simplest and most interpretable models in machine learning.

It works by finding a linear relationship between one or more input features (X) and a single output target (Y). In its simplest form with one feature, this is the classic "line of best fit" from algebra.

The Equation of a Line:

For a single feature (Simple Linear Regression), the formula is:

Y = β₀ + β₁X₁

Y is the value we want to predict (the target).
X₁ is our input feature.
β₀ is the intercept: the value of Y when X = 0.
β₁ is the coefficient (or slope): how much Y changes for a one-unit change in X.

Multiple Linear Regression

Y = β₀ + β₁X₁ + β₂X₂ + … + β_nX_n

Topic 2: How Does it "Learn"? (The Cost Function)

How does the model "know" what the "best" coefficients are? It tries to find the line that minimizes the total error.

The "error" for a single data point is called the residual. It's the vertical distance between the actual data point (y_i) and the model's predicted line (y_i).

Cost Function: Mean Squared Error (MSE)

The model's goal is to minimize a "cost function." For linear regression, this is almost always the Mean Squared Error (MSE). The model:

Calculates the residual (error) for every single point in the data.
Squares each error (to make them all positive and to heavily penalize large errors).
Calculates the Mean (average) of all these squared errors.

MSE = (1 / n) Σ_i=1ⁿ(y_i − y_i)²

The "learning" process, called Ordinary Least Squares (OLS), is a mathematical method that finds the exact coefficient values (β₀, β₁, …) that result in the lowest possible MSE.

Topic 3: Evaluating Your Model (Metrics)

After you've trained your model, how do you know if it's any good? There are two key metrics you must know.

1. Root Mean Squared Error (RMSE)

The MSE is great for training, but its units are squared (e.g., "squared dollars"), which is hard to interpret. To fix this, we just take its square root to get the RMSE.

RMSE = √(MSE)

Interpretation: The RMSE is the average "distance" (error) of your model's predictions from the actual values, in the original units of your target.

An RMSE of 50,000 for house prices means your model is, on average, "off" by about 50,000.

2. R-Squared (R²) — The "Coefficient of Determination"

This is the most popular metric for regression. R² tells you what percentage of the variance in the target variable (Y) is explained by your input features (X).

Interpretation:

R² = 1.0: A perfect model — your features explain 100% of the change in the target.
R² = 0.65: “Our model explains 65% of the variance in house prices using the features we provided.”
R² = 0.0: Your model is no better than simply predicting the average Y value for every point.
R² < 0.0: Your model is actively worse than just guessing the average.

Topic 4: Assumptions and Feature Scaling

Linear Regression is powerful, but it relies on a few key assumptions. The most important one is that **the relationship between your features and your target is linear** (i.e., a straight line, not a curve).

Why Feature Scaling Matters

While a simple linear regression will work *without* scaling, it's a critical best practice.

Interpretation: When you use StandardScaler (to give all features a mean of 0 and std dev of 1), you can directly compare the coefficients. The feature with the largest *absolute* coefficient (e.g., -0.8) is the most "important" in the model's decision-making, as all features are on the same scale.
Compatibility: Many other models (like SVMs, Neural Networks, PCA) or advanced regression types (Ridge, Lasso) *require* scaling to work correctly. It's a good habit to always scale your data.

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# 1. Split your data FIRST (to prevent data leakage)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 2. Create the scaler
scaler = StandardScaler()

# 3. Fit the scaler ONLY on X_train
X_train_scaled = scaler.fit_transform(X_train)

# 4. Use that scaler to transform X_test
X_test_scaled = scaler.transform(X_test)

# 5. Now, train your model on X_train_scaled

Linear Regression

Topic 1: What is Linear Regression?

Multiple Linear Regression

Topic 2: How Does it "Learn"? (The Cost Function)

Cost Function: Mean Squared Error (MSE)

Topic 3: Evaluating Your Model (Metrics)

1. Root Mean Squared Error (RMSE)

2. R-Squared (R²) — The "Coefficient of Determination"

Topic 4: Assumptions and Feature Scaling

Why Feature Scaling Matters

Practice Question

Loading Question...

Upload Your Own CSV

Output Console