Unsupervised Learning: Clustering

Learn how to group unlabeled data using K-Means, DBSCAN, and Hierarchical methods.

Python Basics Beginner 30 min

💡 Topic 1: Clustering (Unsupervised Learning)

**Clustering** is a core technique in **Unsupervised Learning**. This means the algorithm analyzes data **without any predefined labels** or correct answers (unlike regression or classification).

The algorithm's job is to automatically **find hidden structures or groups** in the data. It clusters data points that are "similar" to each other based on their features.

Real-World Analogy: Music Recommendations

The app groups users by similar listening habits (e.g., users who listen to fast, guitar-heavy songs vs. slow, orchestral songs). These groups are the **clusters** used for targeted recommendations.

Key Use Cases:

Customer Segmentation: Grouping customers by purchasing habits.
Anomaly Detection: Identifying data points that don't fit into any group (outliers).

🎯 Topic 2: K-Means Clustering (Centroid-Based)

**K-Means** is the most popular clustering algorithm. It aims to partition data into a **predefined number, K, of clusters**.

The algorithm works by iteratively finding the best position for **cluster centers (centroids)** through two steps: Assign (each point goes to the nearest center) and Update (the center moves to the average location of its assigned points).

⚠️ Critical Requirement: Data Scaling

K-Means measures distance. If features have vastly different scales (e.g., 'Age' 20-60 vs 'Salary' 50k-1M), the larger feature dominates. You must scale your data (e.g., with StandardScaler) before using K-Means.

💻 Example: K-Means Implementation

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

model = KMeans(n_clusters=3, random_state=42)
model.fit(X_scaled)
labels = model.labels_

🦴 Topic 3: The Elbow Method (Choosing K)

The major weakness of K-Means is that you have to choose **K** beforehand. The **Elbow Method** is a technique to visually determine the optimal number of clusters.

We measure **Inertia** (or WCSS - Within-Cluster Sum of Squares), which is the sum of squared distances from every point to its closest centroid.

We plot K vs Inertia. The ideal K value is the point where the rate of decrease in inertia slows down significantly (the **diminishing returns**).

🌳 Topic 4: Hierarchical Clustering (Dendrograms)

This method does not require a predefined K. Instead, it builds a **tree structure** of clusters called a **Dendrogram**, showing the merging process of every data point.

Agglomerative (Bottom-Up):

The most common type starts with every data point as its own cluster and then progressively merges the two **closest** clusters until only one giant cluster remains (like building a family tree from the bottom up).

💻 Example: Implementing Agglomerative Clustering

from sklearn.cluster import AgglomerativeClustering

agg_model = AgglomerativeClustering(n_clusters=5)
labels = agg_model.fit_predict(X_scaled)

🌌 Topic 5: DBSCAN (Density-Based)

**DBSCAN** is powerful because it finds clusters of **arbitrary shapes** (moons, spirals), unlike K-Means which is limited to spherical clusters. It groups points based on density.

Key Parameters:

eps: The radius of the neighborhood to check for density.
min_samples: The minimum number of points required within the eps radius to form a dense region (a cluster core).

DBSCAN automatically flags points that don't belong to any dense region as **Noise** (outliers), labeled as **-1**.

📚 Module Summary

K-Means: Fast, scalable. Requires predefined K. Assumes spherical clusters. Requires scaling.
Hierarchical: Creates a Dendrogram. No predefined K. Slow for large data.
DBSCAN: Finds arbitrary shapes. Automatically flags outliers (-1). Sensitive to density variations.

🤔 Interview Q&A

Tap on the questions below to reveal the answers.

The major weakness is that it requires the number of clusters (K) to be specified beforehand, and it performs poorly on clusters that are **non-spherical** or have **varying densities**.

K-Means measures distance (Euclidean distance). If features are on different scales (e.g., Age 20-60 vs Income 50k-1M), the larger feature will disproportionately influence the distance calculation, leading to inaccurate clusters.

DBSCAN automatically handles outliers by labeling them with **-1**. These points are considered "noise" because they do not belong to any dense region.

You plot K vs. **Inertia** (WCSS). The optimal K is the point on the graph where the line suddenly stops bending sharply and starts to flatten out. This represents the point of **diminishing returns**.