Unsupervised Learning: Clustering
Learn how to group unlabeled data using K-Means, DBSCAN, and Hierarchical methods.
💡 Topic 1: Clustering (Unsupervised Learning)
**Clustering** is a core technique in **Unsupervised Learning**. This means the algorithm analyzes data **without any predefined labels** or correct answers (unlike regression or classification).
The algorithm's job is to automatically **find hidden structures or groups** in the data. It clusters data points that are "similar" to each other based on their features.
Real-World Analogy: Music Recommendations
The app groups users by similar listening habits (e.g., users who listen to fast, guitar-heavy songs vs. slow, orchestral songs). These groups are the **clusters** used for targeted recommendations.
Key Use Cases:
- Customer Segmentation: Grouping customers by purchasing habits.
- Anomaly Detection: Identifying data points that don't fit into any group (outliers).
🎯 Topic 2: K-Means Clustering (Centroid-Based)
**K-Means** is the most popular clustering algorithm. It aims to partition data into a **predefined number, K, of clusters**.
The algorithm works by iteratively finding the best position for **cluster centers (centroids)** through two steps: Assign (each point goes to the nearest center) and Update (the center moves to the average location of its assigned points).
⚠️ Critical Requirement: Data Scaling
K-Means measures distance. If features have vastly different scales (e.g., 'Age' 20-60 vs 'Salary' 50k-1M), the larger feature dominates. You must scale your data (e.g., with StandardScaler) before using K-Means.
💻 Example: K-Means Implementation
from sklearn.cluster import KMeansfrom sklearn.preprocessing import StandardScaler
scaler = StandardScaler()X_scaled = scaler.fit_transform(X)
model = KMeans(n_clusters=3, random_state=42)model.fit(X_scaled)labels = model.labels_
🦴 Topic 3: The Elbow Method (Choosing K)
The major weakness of K-Means is that you have to choose **K** beforehand. The **Elbow Method** is a technique to visually determine the optimal number of clusters.
We measure **Inertia** (or WCSS - Within-Cluster Sum of Squares), which is the sum of squared distances from every point to its closest centroid.
We plot K vs Inertia. The ideal K value is the point where the rate of decrease in inertia slows down significantly (the **diminishing returns**).
🌳 Topic 4: Hierarchical Clustering (Dendrograms)
This method does not require a predefined K. Instead, it builds a **tree structure** of clusters called a **Dendrogram**, showing the merging process of every data point.
Agglomerative (Bottom-Up):
The most common type starts with every data point as its own cluster and then progressively merges the two **closest** clusters until only one giant cluster remains (like building a family tree from the bottom up).
💻 Example: Implementing Agglomerative Clustering
from sklearn.cluster import AgglomerativeClustering
agg_model = AgglomerativeClustering(n_clusters=5)labels = agg_model.fit_predict(X_scaled)
🌌 Topic 5: DBSCAN (Density-Based)
**DBSCAN** is powerful because it finds clusters of **arbitrary shapes** (moons, spirals), unlike K-Means which is limited to spherical clusters. It groups points based on density.
Key Parameters:
eps: The radius of the neighborhood to check for density.min_samples: The minimum number of points required within theepsradius to form a dense region (a cluster core).
DBSCAN automatically flags points that don't belong to any dense region as **Noise** (outliers), labeled as **-1**.
📚 Module Summary
- K-Means: Fast, scalable. Requires predefined K. Assumes spherical clusters. Requires scaling.
- Hierarchical: Creates a Dendrogram. No predefined K. Slow for large data.
- DBSCAN: Finds arbitrary shapes. Automatically flags outliers (-1). Sensitive to density variations.
🤔 Interview Q&A
Tap on the questions below to reveal the answers.