DSPython - Random Forest

Topic 1: What is a Random Forest?

A **Random Forest** is an **ensemble** learning method that operates by constructing a multitude of **Decision Trees** at training time.

Think of a single Decision Tree as one "smart" expert. This expert might be very knowledgeable about the training data, but they can also be biased and "overfit," meaning they don't generalize well to new data.

A Random Forest is like a "committee of experts." It builds an entire "forest" of different, slightly-less-smart Decision Trees and then combines their predictions. For a classification task, the final prediction is the one that gets the most "votes" from the trees. For a regression task, it's the average of their predictions.

Analogy: The Wisdom of the Crowd

If you ask one "expert" to guess the number of jellybeans in a jar, they might be wildly wrong. If you ask 500 people and take the average of their guesses, the result is almost always *far* more accurate than any single guess.

A Random Forest applies this logic. It "averages out" the individual errors and biases of many trees to produce a single, powerful, and stable prediction.

Topic 2: How it Works (Part 1) - Bagging

Random Forests get their diversity from two main sources. The first is called **Bagging**, which stands for **B**ootstrap **Agg**regating.

Bootstrap: This is the "B" part. Let's say you have 1000 rows of data. To build Tree 1, you don't use all 1000 rows. Instead, you create a new 1000-row dataset by randomly picking rows *with replacement*. This means some original rows might be picked 3 times, and others might not be picked at all.

Aggregate: This is the "Agg" part. You repeat this "bootstrap" process 100 times, creating 100 unique (but overlapping) datasets. You then train 100 separate Decision Trees, one on each dataset.

Vote: When you get a new data point to predict, you feed it to all 100 trees. The final answer is the majority vote (for classification) or the average (for regression).

Topic 3: How it Works (Part 2) - Feature Randomness

This is the "Random" part of Random Forest, and it's a brilliant addition to Bagging. It's the second source of diversity for the trees.

If you only used Bagging, all your trees would still be pretty similar, because they would all probably pick the *same best feature* (e.g., "petal length") at the very first split. This makes them highly correlated, which reduces the benefit of averaging them.

The "Random" Trick:

When building each tree, at *every single split point*, the model is **not** allowed to consider all available features. Instead, it is only allowed to choose from a small, **random subset of features**.

For example, if you have 10 features, the model might be forced to pick the best split from only 3 random features (e.g., "Age", "Salary", "Location"). At the *next* split, it picks 3 *different* random features.

This forces the trees to be different. One tree might be an "expert" on Age, while another might be an "expert" on Salary. This diversity is what makes the final "vote" so robust and accurate.

Topic 4: Parameters & Advantages

Random Forests are powerful and don't require feature scaling, but they have a few key parameters to tune.

Key Parameters in `sklearn`

n_estimators: The number of trees in the forest. More is generally better (e.g., 100, 500, 1000), but it takes longer to train.

max_features: The number of random features to consider at each split. A common default is `sqrt(n_features)`.

max_depth: The maximum depth of each individual tree. Pruning the trees (e.g., `max_depth=5`) can reduce overfitting of the *entire* ensemble.

min_samples_leaf: The minimum number of samples required to be at a leaf node. This also helps control overfitting.

Advantages

High Accuracy: One of the best "off-the-shelf" classifiers.

Robust to Overfitting: The "averaging" process (low variance) makes it much less prone to overfitting than a single Decision Tree.

No Scaling Needed: Like all tree models, it doesn't care if your features are on different scales.

Feature Importance: It can tell you which features were most "important" in making its predictions.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(
    n_estimators=100,  # 100 trees
    max_depth=10,      # Each tree can only be 10 levels deep
    random_state=42    # For reproducible results
)
model.fit(X_train, y_train)

# Get feature importances
importances = model.feature_importances_

Random Forest

Topic 1: What is a Random Forest?

Topic 2: How it Works (Part 1) - Bagging

Topic 3: How it Works (Part 2) - Feature Randomness

Topic 4: Parameters & Advantages

Key Parameters in `sklearn`

Advantages

Practice Question

Loading Question...

Upload Your Own CSV

Output Console