DSPython Logo DSPython

Dimensionality Reduction

Learn to visualize high-dimensional data and combat the "Curse of Dimensionality" with PCA and t-SNE.

Machine Learning Intermediate 60 min

Topic 1: What is Dimensionality Reduction?

Dimensionality Reduction is the process of reducing the number of features (or "dimensions") in a dataset while trying to preserve as much of the important information as possible.

Imagine you have a dataset with 500 features (columns). It's impossible for a human to "see" the relationships in 500-dimensional space. This is the "Curse of Dimensionality."

The Curse of Dimensionality:

As the number of features (dimensions) increases, the amount of data needed to fill that space grows exponentially. Your data becomes "sparse," like a few lonely stars in a vast, empty universe.

  • Models Overfit: Models start "memorizing" noise instead of learning patterns.
  • Calculations Slow Down: More features = more calculations.
  • Visualization is Impossible: We can't plot or understand 500-D space.

Why Reduce Dimensions?

  • 1. Visualization: To "squash" data from 100D down to 2D or 3D so we can plot it on a scatter chart and see clusters or patterns.
  • 2. Speed: Fewer features means models (like Decision Trees or SVMs) train *much* faster.
  • 3. Performance: It can remove "noise" features that are confusing the model, sometimes leading to *better* accuracy.

Topic 2: Principal Component Analysis (PCA)

PCA is the most popular technique for dimensionality reduction. It is a linear technique, meaning it finds new features that are simple, straight-line combinations of the old ones.

PCA's goal is to find the "axes" in the data that capture the most variance (the most spread or information).

Analogy: Squashing a 3D Object

Imagine a 3D model of a person (3 dimensions: X, Y, Z). You want to create a 2D shadow (2 dimensions). To capture the *most information*, you wouldn't shine the light from the top (you'd just see a small circle). You'd shine it from the front, capturing the full "silhouette."

PCA does this mathematically. It finds the best "angle" (the **Principal Component 1**) to project the data onto that captures the most variance (the "silhouette"). Then it finds the next-best angle (Principal Component 2), and so on.

Key PCA Concepts:

  • Principal Component 1 (PC1): The "new" feature (axis) that captures the single largest amount of variance in the data.
  • Principal Component 2 (PC2): The *next* axis, which must be orthogonal (at a 90-degree angle) to PC1, that captures the most *remaining* variance.
  • Important: PCA is highly sensitive to the scale of your data. You must scale your data (e.g., with StandardScaler) *before* applying PCA.

Topic 3: Explained Variance

After you run PCA, how do you know if your 2D shadow is any good? You check the "Explained Variance Ratio."

This tells you what percentage of the *total* information (variance) from the original data is captured by each of your new components.

# After fitting PCA:
print(pca.explained_variance_ratio_)

# Output might be: [0.729 0.228]

This output means:

  • PC1 (the first component) captured 72.9% of all the original variance.
  • PC2 (the second component) captured 22.8% of all the original variance.

Together, our two new features have captured 72.9% + 22.8% = 95.7% of the total information from the original 4 features. That's an excellent trade-off! We reduced the features by 50% but only lost 4.3% of the information.

The Scree Plot

A "Scree Plot" is a simple line chart of the *cumulative* explained variance. It helps you decide how many components to keep. You look for the "elbow" — the point where adding another component gives you diminishing returns (doesn't add much new information).


Topic 4: t-SNE (t-distributed Stochastic Neighbor Embedding)

While PCA is a powerful *transformation* tool, **t-SNE** is a powerful *visualization* tool. It is non-linear and has one primary goal: to find a 2D or 3D representation of your data that preserves the "neighborhoods."

In simple terms, t-SNE tries to keep points that were "close" in 500-D space "close" in 2D space, and points that were "far apart" in 500-D space "far apart" in 2D space.

PCA vs. t-SNE: The Key Difference

  • Use PCA... as a pre-processing step to reduce features *before* feeding them into a model. It's a true mathematical transformation.
  • Use t-SNE... *only* for visualization. It is fantastic at revealing complex clusters and non-linear structures that PCA would miss.
  • WARNING: Do not use t-SNE to transform data before modeling. It is slow, and the "distances" between clusters in a t-SNE plot are not mathematically meaningful in the same way as PCA.
from sklearn.manifold import TSNE

# t-SNE is computationally expensive, so it's common
# to first reduce data with PCA (to e.g., 50 dimensions)
# and then use t-SNE to go from 50 down to 2.

# For Iris, we can just run it directly on the scaled data.
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

# Now X_tsne is ready to be plotted.