Dimensionality Reduction
Learn to visualize high-dimensional data and combat the "Curse of Dimensionality" with PCA and t-SNE.
Topic 1: What is Dimensionality Reduction?
Dimensionality Reduction is the process of reducing the number of features (or "dimensions") in a dataset while trying to preserve as much of the important information as possible.
Imagine you have a dataset with 500 features (columns). It's impossible for a human to "see" the relationships in 500-dimensional space. This is the "Curse of Dimensionality."
The Curse of Dimensionality:
As the number of features (dimensions) increases, the amount of data needed to fill that space grows exponentially. Your data becomes "sparse," like a few lonely stars in a vast, empty universe.
- Models Overfit: Models start "memorizing" noise instead of learning patterns.
- Calculations Slow Down: More features = more calculations.
- Visualization is Impossible: We can't plot or understand 500-D space.
Why Reduce Dimensions?
- 1. Visualization: To "squash" data from 100D down to 2D or 3D so we can plot it on a scatter chart and see clusters or patterns.
- 2. Speed: Fewer features means models (like Decision Trees or SVMs) train *much* faster.
- 3. Performance: It can remove "noise" features that are confusing the model, sometimes leading to *better* accuracy.
Topic 2: Principal Component Analysis (PCA)
PCA is the most popular technique for dimensionality reduction. It is a linear technique, meaning it finds new features that are simple, straight-line combinations of the old ones.
PCA's goal is to find the "axes" in the data that capture the most variance (the most spread or information).
Analogy: Squashing a 3D Object
Imagine a 3D model of a person (3 dimensions: X, Y, Z). You want to create a 2D shadow (2 dimensions). To capture the *most information*, you wouldn't shine the light from the top (you'd just see a small circle). You'd shine it from the front, capturing the full "silhouette."
PCA does this mathematically. It finds the best "angle" (the **Principal Component 1**) to project the data onto that captures the most variance (the "silhouette"). Then it finds the next-best angle (Principal Component 2), and so on.
Key PCA Concepts:
- Principal Component 1 (PC1): The "new" feature (axis) that captures the single largest amount of variance in the data.
- Principal Component 2 (PC2): The *next* axis, which must be orthogonal (at a 90-degree angle) to PC1, that captures the most *remaining* variance.
- Important: PCA is highly sensitive to the scale of your data. You must scale your data (e.g., with
StandardScaler) *before* applying PCA.
Topic 3: Explained Variance
After you run PCA, how do you know if your 2D shadow is any good? You check the "Explained Variance Ratio."
This tells you what percentage of the *total* information (variance) from the original data is captured by each of your new components.
# After fitting PCA:
print(pca.explained_variance_ratio_)
# Output might be: [0.729 0.228]
This output means:
- PC1 (the first component) captured 72.9% of all the original variance.
- PC2 (the second component) captured 22.8% of all the original variance.
Together, our two new features have captured 72.9% + 22.8% = 95.7% of the total information from the original 4 features. That's an excellent trade-off! We reduced the features by 50% but only lost 4.3% of the information.
The Scree Plot
A "Scree Plot" is a simple line chart of the *cumulative* explained variance. It helps you decide how many components to keep. You look for the "elbow" — the point where adding another component gives you diminishing returns (doesn't add much new information).
Topic 4: t-SNE (t-distributed Stochastic Neighbor Embedding)
While PCA is a powerful *transformation* tool, **t-SNE** is a powerful *visualization* tool. It is non-linear and has one primary goal: to find a 2D or 3D representation of your data that preserves the "neighborhoods."
In simple terms, t-SNE tries to keep points that were "close" in 500-D space "close" in 2D space, and points that were "far apart" in 500-D space "far apart" in 2D space.
PCA vs. t-SNE: The Key Difference
- Use PCA... as a pre-processing step to reduce features *before* feeding them into a model. It's a true mathematical transformation.
- Use t-SNE... *only* for visualization. It is fantastic at revealing complex clusters and non-linear structures that PCA would miss.
- WARNING: Do not use t-SNE to transform data before modeling. It is slow, and the "distances" between clusters in a t-SNE plot are not mathematically meaningful in the same way as PCA.
from sklearn.manifold import TSNE
# t-SNE is computationally expensive, so it's common
# to first reduce data with PCA (to e.g., 50 dimensions)
# and then use t-SNE to go from 50 down to 2.
# For Iris, we can just run it directly on the scaled data.
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
# Now X_tsne is ready to be plotted.