DSPython Logo DSPython

K-Nearest Neighbors (KNN)

Learn to build a simple, effective classifier based on "neighborly" distances.

Machine Learning Intermediate 60 min

Topic 1: What is K-Nearest Neighbors?

K-Nearest Neighbors (KNN) is one of the simplest and most intuitive machine learning algorithms. It's a "lazy learner," which means it doesn't "learn" a complex model during training. Instead, it just memorizes the entire training dataset.

When it's time to predict, KNN looks at the new, unknown data point and finds its "K" closest neighbors from the training set. The prediction is then made by a simple vote (for classification) or average (for regression).


Analogy: Moving to a New Neighborhood

Imagine you want to predict the political party of a new person in town. You don't know anything about them except their address.

  1. You decide to check their K=5 closest neighbors.
  2. You find that 3 of the neighbors are "Party A" and 2 are "Party B".
  3. By a majority vote, you predict the new person is most likely "Party A".

That's it. That is how KNN classification works. If you were predicting their income (regression), you would just average the incomes of the 5 closest neighbors.


Topic 2: The "K" in KNN (Bias vs. Variance)

The "K" (the number of neighbors to check) is the most important parameter you must choose. It controls the Bias-Variance Trade-off.

Small K (e.g., K=1)

  • High Variance, Low Bias: The model is very flexible and sensitive to noise. If the single closest neighbor is an outlier or a mislabeled point, your prediction will be wrong. This is overfitting.
  • The decision boundary will be very jagged and complex.

Large K (e.g., K=100)

  • Low Variance, High Bias: The model is very smooth and stable, but it might ignore local patterns. If you check 100 neighbors, you're basically just predicting the majority class of the whole dataset. This is underfitting.
  • The decision boundary will be very smooth.

How to find the best K?

You find the best K by testing many different values and seeing which one gives the highest accuracy on your test data. A common method is to plot K vs. Accuracy and look for the "elbow" or the highest point.


Topic 3: The Critical Step: Feature Scaling

KNN is based entirely on distance (usually Euclidean distance). This means it is extremely sensitive to the scale of your features. If you do not scale your data, KNN will fail.

Analogy: Age vs. Salary

Imagine you have a dataset with two features: Age (from 20 to 60) and Salary (from 50,000 to 100,000).

You want to find the distance between two people:

  • Person A: Age 30, Salary 60,000
  • Person B: Age 35, Salary 62,000

The distance in "Age" is (35 - 30) = 5.
The distance in "Salary" is (62,000 - 60,000) = 2,000.

When the algorithm calculates the total distance, the 'Salary' feature's distance (2,000) will completely dominate the 'Age' feature's distance (5). The model will only pay attention to salary and will completely ignore age. This is bad.

By using StandardScaler, you put both features on the same scale (e.g., from -3 to +3), so they are treated with equal importance.

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

# 1. Create the scaler
scaler = StandardScaler()

# 2. Fit the scaler ONLY on the training data
X_train_scaled = scaler.fit_transform(X_train)

# 3. Transform the test data (do NOT fit again)
X_test_scaled = scaler.transform(X_test)

# 4. Now, train your model on the SCALED data
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train_scaled, y_train)

# 5. Make predictions on the SCALED test data
preds = model.predict(X_test_scaled)

Topic 4: Pros and Cons of KNN

Pros

  • Simple & Intuitive: Easy to understand and explain to non-technical stakeholders.
  • No "Training" Time: It's a "lazy learner." The fit method just loads the data into memory, which is instant.
  • Flexible: Can be easily used for classification, regression, and finding similar items (like a recommendation engine).

Cons

  • Slow Prediction Time: For every single prediction, it must calculate the distance to every single point in the training set. This is terrible for large datasets.
  • Curse of Dimensionality: Does not work well with many features (high dimensions). Distances become less meaningful in high-dimensional space.
  • Requires Scaling: As explained, it's mandatory to scale your features.
  • Sensitive to Outliers: A few outliers can "pull" the decision boundary and lead to incorrect predictions.