DSPython Logo DSPython

Handling Outliers

Identify and manage extreme values that can skew your data analysis.

Data Science Intermediate 60 min

📉 Introduction: The "Black Sheep" of Data

An **Outlier** is a data point that is extremely far away from the rest of the data. They are the "black sheep" of your dataset.

Outliers create a huge problem because they **heavily skew** statistical averages. For example, if the average salary is $50,000, but the CEO's $10 million salary is included, the mean will shoot up, giving a false impression of the average worker's pay.

Outliers can be a genuine finding (like a sudden spike in sales) or a simple error (like an age of 150 years). We must identify them before feeding the data into a machine learning model.

📊 Topic 1: Visualizing Outliers (Boxplots)

The best visual tool for identifying outliers is the **Boxplot** (or box-and-whisker plot). It visually represents the distribution and highlights extreme values.

[Image of a boxplot chart showing quartiles and outlier points]

🎯 Boxplot Components:

  • **Median (Q2):** The line in the middle of the box (50th percentile).
  • **The Box:** Represents the middle 50% of the data (from Q1 to Q3).
  • **Whiskers:** The lines extending from the box, representing the range of the non-outlier data.
  • **Outliers:** Individual points plotted outside the whiskers (potential errors).

🔢 Topic 2: Identifying Outliers with IQR

The **Interquartile Range (IQR)** is the most robust statistical method for finding outliers, as it is unaffected by the outliers themselves.

📝 IQR Calculation (The 1.5 Rule):

  • **1. Calculate Quartiles:** $Q1 = 25th$ percentile, $Q3 = 75th$ percentile.
  • **2. Calculate IQR:** $IQR = Q3 - Q1$
  • **3. Define Boundaries:** * Lower Bound (Min Whisker) $= Q1 - (1.5 \times IQR)$ * Upper Bound (Max Whisker) $= Q3 + (1.5 \times IQR)$

💻 Example: Calculating Boundaries

# Example Data: [10, 20, 22, 24, 25, 26, 28, 30, 100]
Q1 = df['score'].quantile(0.25) # Result: 22.0
Q3 = df['score'].quantile(0.75) # Result: 28.0
IQR = Q3 - Q1                      # Result: 6.0

upper_bound = Q3 + (1.5 * IQR) # 37.0
lower_bound = Q1 - (1.5 * IQR) # 13.0

# Filter for outliers (values outside 13.0 to 37.0)
outliers = df[(df['score'] > upper_bound) | (df['score'] < lower_bound)]
# Result: Rows containing 10 and 100 are flagged as outliers.

🔭 Topic 3: Z-Score Method

The **Z-Score** method is ideal when your data is **normally distributed** (symmetrical bell curve). It measures how far a point is from the mean in terms of standard deviations.

📝 Z-Score Rule:

Any data point with a Z-Score greater than **3** or less than **-3** is considered an extreme outlier (it's outside 99.7% of the data).

💻 Example: Calculating Z-Scores

# Calculate Z-Scores using Pandas/NumPy
mean = df['score'].mean()
std = df['score'].std()

# The Z-Score formula: (x - mean) / std
z_scores = np.abs((df['score'] - mean) / std)

# Filter for outliers (Z-Score > 3)
outliers_z = df[z_scores > 3]

⚠️ Limitation:

The Z-Score is highly sensitive to the mean and standard deviation. If the dataset itself is already heavily skewed by outliers, the Z-Score method may fail to identify the biggest ones correctly.

🔨 Topic 4: Techniques to Treat Outliers

Once outliers are identified, you cannot just leave them there. The method you choose depends on the reason for the outlier.

1. Removal (Deletion):

**Best for:** Clear data entry errors (e.g., age = 150). **Method:** Use a boolean filter (df[~outliers_filter]) or simply drop the row index.

2. Capping (Winsorizing):

**Best for:** Real-world extreme values (e.g., CEO salary). **Method:** Replace the outlier value with the Upper Boundary (Q3 + 1.5*IQR). This limits its influence but keeps the data point.

# Example Capping
df['score'] = np.where(df['score'] > upper_bound, upper_bound, df['score'])

3. Transformation (Log Scale):

**Best for:** Heavily skewed data (e.g., population/income). **Method:** Apply np.log() to the column. This compresses the wide range of values, bringing outliers closer to the distribution center.

📚 Module Summary

  • **Outlier:** Data point far from the majority, heavily skews the Mean.
  • **Visualize:** Use Boxplots to see outliers easily.
  • **Method 1 (IQR):** Calculates boundaries based on $Q1 \pm (1.5 \times IQR)$. Best for non-normal data.
  • **Method 2 (Z-Score):** Measures distance from Mean in terms of STD. Best for normal data.
  • **Treatment:** Capping (Winsorizing) is generally safer than simple Removal.

🤔 Interview Q&A

Tap on the questions below to reveal the answers.

The **Median** (50th percentile) is least affected by outliers, as it only considers the middle position of the data. The Mean, in contrast, is heavily affected because it sums all values, including the extreme ones.

You should use the **Z-Score** method primarily when you are confident that your data follows a **Normal (Bell Curve) Distribution**. The **IQR** method is preferred when the data is skewed or non-normal, as it relies on percentiles, not the mean/standard deviation.

Capping is the process of replacing an outlier value with a specific boundary value (e.g., replacing any score above 37.0 with 37.0). It is used to reduce the skewing influence of the outlier while preventing the loss of the entire data record.

No, log transformation does not remove outliers, but it **reduces their magnitude**. By compressing the larger values, it makes the data distribution closer to normal, which often improves the performance of linear machine learning models.