Anomaly Detection in Machine Learning
Learn to identify outliers and anomalies using statistical and ML-based methods.
Topic 1: What is Anomaly Detection?
In simple terms, anomaly detection is the process of finding the "odd-one-out" in a set of data. These "odd" data points are called **anomalies** or **outliers**.
They are data points that deviate so much from the "normal" pattern that they seem suspicious. Identifying them is critical in many industries.
Real-World Analogy: Credit Card Fraud
Think about your credit card spending pattern:
- Your Normal Pattern: Groceries in Hyderabad (₹2,000), restaurants in Hyderabad (₹1,500), online shopping (₹3,000).
- The Anomaly: Suddenly, a transaction for ₹5,00,000 appears from a different country (e.g., London).
Your bank's system instantly flags this as an **anomaly** because it doesn't match your normal behavior. This is anomaly detection in action.
Where is it Used?
- Finance: Detecting fraudulent transactions and unusual stock market activity.
- IT & Cybersecurity: Spotting strange network traffic or login attempts that could signal a cyberattack.
- Manufacturing: Identifying vibrations or temperatures in a machine that are "abnormal," predicting a failure *before* it happens.
- Healthcare: Finding unusual patterns in an EKG or MRI scan that could indicate a disease.
Topic 2: The Z-Score Method (Statistical)
This is a classic statistical method. It works by measuring how "surprising" a data point is based on the average and normal range of the data.
Analogy: Class Heights
Imagine a class where the average height (the **Mean**) is 5'8". Most students are close to this, maybe between 5'5" and 5'11". This "normal range" of variation is measured by the **Standard Deviation (SD)**.
The Z-Score Logic:
A Z-Score is a simple number: "How many Standard Deviations (SDs) is this point away from the average?"
- A student who is 5'9" has a Z-Score of ~0.5 (very close to the average).
- A student who is 6'2" has a Z-Score of ~2.0 (taller than average, but still plausible).
- A student who is 7'1" has a Z-Score of ~5.0 (extremely far from the average).
The Rule: In a normal "bell curve," about 99.7% of all data falls within 3 standard deviations of the mean. Therefore, a common rule is that any data point with a Z-Score **greater than 3 or less than -3** is considered an anomaly.
from scipy import stats
import numpy as np
# Sample data with an outlier
data = np.array([1, 2, 2, 3, 3, 4, 100])
# 1. Calculate Z-Scores
# We use np.abs() because we care about distance (positive or negative)
z_scores = np.abs(stats.zscore(data))
# print(z_scores) might look like: [0.7, 0.6, 0.6, 0.5, 0.5, 0.4, 3.5]
# 2. Find points where the Z-Score is greater than 3
anomalies = data[z_scores > 3]
print(anomalies)
# Output: [100]
Pros: Very simple, fast to calculate, and widely understood.
Cons: It only works well if your data follows a "Normal Distribution" (a bell curve). If your data is skewed (lopsided), the Z-Score can be misleading.
Topic 3: The IQR Method (Interquartile Range)
This method is more "robust" than Z-Score, meaning it **works well even if the data is not a perfect bell curve**.
Analogy: Lining Up Students by Height
Instead of averages, this method uses percentiles (ranking).
- Line up 100 students by height.
- Q2 (Median): The 50th student. This is the *true* middle (50% are shorter, 50% are taller).
- Q1 (First Quartile): The 25th student. (25% are shorter).
- Q3 (Third Quartile): The 75th student. (75% are shorter).
The IQR Logic:
The "Interquartile Range" (IQR) is the distance between Q3 and Q1: IQR = Q3 - Q1. This represents the "middle 50%" of your data. This is your "normal" box.
We then build "fences" to define a "safe zone" outside this box. Anything beyond the fences is an outlier.
- Lower Fence:
Q1 - (1.5 * IQR) - Upper Fence:
Q3 + (1.5 * IQR)
This is the exact logic used to draw a Boxplot. The dots you see outside the "whiskers" of a boxplot are the anomalies identified by this IQR method.
# Assuming 'df' is a DataFrame with a 'value' column
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - (1.5 * IQR)
upper_bound = Q3 + (1.5 * IQR)
# Find any values that are outside this "safe zone"
anomalies = df[(df['value'] < lower_bound) | (df['value'] > upper_bound)]
print(anomalies)
Pros: Very effective. Not fooled by skewed data. Easy to understand and visualize with a boxplot.
Cons: Primarily designed for a single variable (univariate). Can be too basic for complex, multi-dimensional data.
Topic 4: Isolation Forest (ML-Based)
This is a modern, powerful Machine Learning algorithm. It has a completely different and clever strategy.
The Core Idea: Instead of profiling "normal" points, it actively tries to find the "lonely" ones. It works on the principle that **anomalies are easier to "isolate" (separate) than normal points.**
Analogy: Alien in a Crowd
Imagine a room with 100 people (a dense crowd) and 1 alien standing alone in a corner.
- To "isolate" one **normal person** from the crowd, you'd need to make *many* random cuts (dividing lines) to separate them from their neighbors.
- To "isolate" the **alien**, you'd probably only need *one or two* random cuts.
The Isolation Forest Logic:
The algorithm builds many random "decision trees" (this is the "Forest"). Each tree randomly "cuts" (splits) the data.
It then counts the average number of splits it takes to make each data point "isolated" (all by itself).
- Normal Points: Require *many* splits to be isolated.
- Anomalies: Require *very few* splits to be isolated.
from sklearn.ensemble import IsolationForest
# Assuming X has our data (e.g., df[['value']])
# 'contamination' is your *guess* of what percent of the data is anomalous
# (e.g., 0.1 = 10%, 0.05 = 5%)
model = IsolationForest(contamination=0.05, random_state=42)
model.fit(X)
# predict() returns:
# 1 (for normal points)
# -1 (for anomalies)
preds = model.predict(X)
anomalies = X[preds == -1]
print(anomalies)
Pros: Extremely fast and effective, even on very large, high-dimensional datasets (many columns). Doesn't need data to be scaled.
Cons: You need to provide the `contamination` parameter (your "guess"), which you might not always know in advance.
Topic 5: Local Outlier Factor (LOF) (ML-Based)
This method is all about **context**. It asks, "Is this point an outlier *compared to its local neighbors*?"
Analogy: City vs. Village
Imagine you build a house that is 1 km away from any other building.
- Case 1 (City): You build this house in a dense city like Hyderabad, where all other houses are packed tightly together. You are a **local anomaly**. Your density is much lower than your neighbors.
- Case 2 (Village): You build this house in a rural village, where all houses are 1 km apart. You are **normal**. Your density matches your neighbors.
The LOF Logic:
LOF calculates a "local density score" for each data point. It then compares this score to the scores of its closest neighbors (e.g., its 20 nearest neighbors, set by `n_neighbors`).
If a point's density is **significantly lower than its neighbors' densities** (like the city example), it is flagged as an anomaly.
from sklearn.neighbors import LocalOutlierFactor
# 'n_neighbors' tells it how many neighbors to check for density
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
# fit_predict() returns 1 (normal) or -1 (anomaly)
preds = lof.fit_predict(X)
anomalies = X[preds == -1]
print(anomalies)
Pros: Excellent at finding anomalies in datasets that have different "clusters" of varying densities (where a global method like Z-Score would fail).
Cons: Can be computationally slow on very large datasets because it has to calculate distances between points.