Decision Trees in Machine Learning
Learn to build intuitive classification models, visualize them, and prevent overfitting.
Topic 1: What is a Decision Tree?
A Decision Tree is one of the most intuitive and powerful models in machine learning. It's a supervised learning algorithm used for both classification (Is this a Cat, Dog, or Bird?) and regression (What will the price be?).
It works by asking a series of simple "yes/no" questions to split the data, just like a flowchart or a game of "20 Questions."

Real-World Analogy: Classifying a Fruit
Imagine you want to build a model to identify a fruit. The tree would learn a set of rules:
- Root Node: "Is the color Red?"
- Yes: "Is the diameter < 2 inches?"
- Yes: It's a Cherry (Leaf Node)
- No: It's an Apple (Leaf Node)
- No: "Is the color Yellow?"
- Yes: It's a Banana (Leaf Node)
- No: It's a Grape (Leaf Node)
The tree is a "white box" model, meaning we can see and understand every single rule it learned. This is its biggest advantage.
Key Parts of a Tree
- Root Node: The very first question that splits all the data (e.g., "Is the color Red?").
- Decision Node: Any node that asks a question and splits the data further.
- Leaf Node: The final answer. A node that makes a final prediction (e.g., "Apple," "Cherry").
Topic 2: How Does a Tree "Learn"?
How does the tree know which question to ask first? Why "Is the color Red?" and not "Is the diameter < 3 inches?"
The tree "learns" by finding the question that creates the purest possible split. It wants to find a split that results in new groups that are as "un-mixed" as possible.
Analogy: Sorting Marbles
Imagine a jar (a node) with 50 red and 50 blue marbles (very "impure"). A good split would be one that results in two new jars: one with 40 red / 10 blue, and the other with 10 red / 40 blue. The new jars are "purer" than the original. A perfect split would create a jar of 50 red and a jar of 50 blue (Gini = 0).
We measure this "purity" using two main methods:
1. Gini Impurity
Gini Impurity is the probability of incorrectly labeling a randomly chosen data point if you labeled it based on the distribution of labels in the node.
- A Pure Node (100% Setosa) has a Gini Impurity of 0.0. (0% chance of being wrong).
- An Impure Node (50% Setosa, 50% Versicolor) has a Gini Impurity of 0.5. (50% chance of being wrong).
The tree calculates the Gini Impurity for every possible split and chooses the one that results in the biggest decrease in impurity. This is called Information Gain.
2. Entropy
Entropy is a measure of "disorder" or "surprise."
- A Pure Node (100% Setosa) has 0.0 Entropy (no disorder, no surprise).
- An Impure Node (50% Setosa, 50% Versicolor) has 1.0 Entropy (maximum disorder).
By default, sklearn uses criterion='gini', but you can switch it to criterion='entropy'. They usually produce very similar results.
Topic 3: The Overfitting Problem (And Pruning)
This is the single biggest weakness of Decision Trees. If you don't control them, they will "overfit" the data.
Overfitting means the model "memorizes" the training data instead of "learning" the general patterns. It will create a separate path for every single data point, resulting in a giant, complex tree that is 100% accurate on the training data. But when it sees new, unseen data, it fails miserably.
The Solution: Pruning (Stopping the Growth)
We "prune" the tree by setting limits on its growth before we train it. These are the most important parameters to tune:
max_depth(e.g.,max_depth=3)
This is the most common method. It stops the tree from asking more than 3 "levels" of questions. This forces it to learn only the most important, general patterns.min_samples_leaf(e.g.,min_samples_leaf=5)
This tells the tree: "Don't make a split if any of the resulting leaf nodes would have fewer than 5 samples." This prevents the tree from creating tiny, specific nodes for individual outliers.min_samples_split(e.g.,min_samples_split=20)
This tells the tree: "Don't even try to split a node unless it has at least 20 samples in it."
Topic 4: Feature Importance
One of the best features of a Decision Tree is that it's a "white-box" model. We can easily ask it how it made its decisions.
The tree automatically calculates which features were the most "useful" for splitting the data (i.e., which features provided the most "Information Gain"). A feature that is used at the top of the tree (like "petal length" in the Iris dataset) is far more important than a feature that is never used.
We can access this information easily after training:
# After training a 'model'
model.fit(X_train, y_train)
# Get the importance scores
importances = model.feature_importances_
# This gives an array, e.g., [0.01, 0.0, 0.95, 0.04]
# You can then plot this to see which features mattered most
sns.barplot(x=importances, y=feature_names)
plt.show()