Handling Missing Data

Learn to find, remove, and fill missing values (NaN) in your data.

Data Science Beginner 45 min

🧀 Introduction: The "Swiss Cheese" Data

When you collect data from the real world, it rarely comes perfect. Just like Swiss cheese, your data table will have **holes** (missing values). These holes are a major problem for machine learning models.

Most statistical and ML algorithms (especially simpler ones like Linear Regression) **cannot process** a row that contains even a single missing value. Therefore, handling missing data is the very first critical step in cleaning.

In Pandas, missing data is usually represented by the special value **`NaN`** (Not a Number), which comes from the NumPy library.

🔍 Topic 1: Detecting Missing Values

You use two key methods sequentially: .isnull() (or .isna()) to identify the holes, and .sum() to count them.

🎯 Detection Workflow:

**Step 1:** `df.isnull()` returns a boolean map (True/False).
**Step 2:** `.sum()` treats `True` as 1 and `False` as 0, giving you the total count per column.

💻 Example: Counting NaN# Setup: Create DataFrame with missing data (np.nan)
importnumpyasnp
data = {'A': [1, 2, np.nan], 'B': [5, np.nan, np.nan], 'C': [1, 2, 3]}
df = pd.DataFrame(data)

print("\n--- Missing Count ---")
print(df.isnull().sum())
            
# Output:
# A: 1 (One missing value in column A)
# B: 2 (Two missing values in column B)
# C: 0

🔥 Topic 2: Strategy 1: Dropping Data (`.dropna()`)

If you have a huge dataset (thousands of rows) and only a small percentage of rows have missing data, the fastest and safest solution is to simply remove those rows.

⚠️ Warning: Data Loss

Dropping data means permanently losing the remaining valid information in those rows/columns. Only use this when the data loss is minimal or when imputation is impossible.

💻 Example: Dropping Rows & Columns# Drop rows with ANY NaN (Default behavior)
print(df.dropna())
            
# Drop columns with ANY NaN (axis=1 targets columns)
print(df.dropna(axis=1))
            
# Drop rows ONLY if ALL values are NaN
print(df.dropna(how='all'))
        

💧 Topic 3: Strategy 2: Filling Data (`.fillna()`)

**Imputation** is the process of filling in the missing data with calculated values. This is generally preferred over dropping, as it preserves the structure of your dataset.

💡 Common Imputation Techniques:

**Mean:** Fill with the average of the column (best for normally distributed data).
**Median:** Fill with the middle value (best if there are outliers).
**Mode:** Fill with the most frequent value (best for categorical/non-numeric data).
**Zero/Constant:** Fill with a fixed value (e.g., setting a missing price to $0).

💻 Example: Filling with Mean/Median# Calculate the mean of column A
mean_A = df['A'].mean()
            
# Fill NaN in column A with the calculated mean
df['A'].fillna(mean_A, inplace=True)
            
# Fill NaN in column B with the fixed value 99
df['B'].fillna(99, inplace=True)
            
print(df)
        

➡️ Topic 4: Time Series Filling (`ffill` / `bfill`)

When dealing with time-series data (like stock prices or sensor readings), using the Mean might be inaccurate. Instead, we use observational filling methods.

Forward Fill (`ffill`)

Fills the NaN hole with the value of the **last valid observation** before it. (e.g., If the sensor reading is missing now, use the value from 5 minutes ago).

Backward Fill (`bfill`)

Fills the NaN hole with the value of the **next valid observation** after it. (e.g., Fill a missing price with the price that came 5 minutes later).

💻 Example: ffill / bfill# Data: [10, NaN, 12, NaN]
s = pd.Series([10, np.nan, 12, np.nan])
            
print("ffill:")
print(s.fillna(method='ffill'))  # Output: [10, 10, 12, 12]
print("\nbfill:")
print(s.fillna(method='bfill'))  # Output: [10, 12, 12, NaN] (Last one stays NaN)

🔄 Topic 5: The `.replace()` Method

Pandas' powerful `isnull()` only detects NumPy's `NaN`. If your missing data is represented by strings like **'N/A'**, **'?'**, or **-99**, you must convert them to `np.nan` first.

💻 Example: Replacing Strings to NaNdf_dirty = pd.DataFrame({'Income': [50000, '?', 'N/A']})
            
# 1. Replace all non-NaN missing markers with np.nan
df_cleaned = df_dirty.replace(['?', 'N/A'], np.nan)
            
# 2. Now .isnull() and .fillna() will work correctly
print(df_cleaned.isnull().sum())  # Output: Income 2

📚 Module Summary

**Detect:** Use .isnull().sum() to find the count of `NaN`s.
**Drop:** Use .dropna() to remove rows/columns (use only if data loss is small).
**Impute:** Use .fillna() to replace `NaN` with Mean, Median, or Mode.
**Time Series:** Use ffill (last valid value) or bfill (next valid value).
**Replace:** Use .replace() to convert custom markers ('?', 'N/A') to `np.nan`.

🤔 Interview Q&A

Tap on the questions below to reveal the answers.

Mean imputation is a bad choice when your data has **Outliers** or if the data distribution is heavily **skewed**. In these cases, the mean can be misleading. It is safer to use the **Median** for imputation.

NumPy arrays (the base for Pandas) are designed for fast numerical processing. If Pandas used None, the column would be forced to the slower Python `object` dtype. By using `np.nan`, Pandas can maintain the fast `float` dtype even with missing values.

axis=0 (default) drops rows. If your dataset has many rows but missing values scattered across them, dropna(axis=0) can delete most of your dataset, severely reducing the size and statistical power of your data.

You use **ffill (forward fill)** primarily with **Time Series Data** (e.g., stock prices, temperature). In this context, the value that occurred immediately before the missing value is the best estimate, not the long-term average (mean).

Handling Missing Data

🧀 Introduction: The "Swiss Cheese" Data

🔍 Topic 1: Detecting Missing Values

🎯 Detection Workflow:

💻 Example: Counting NaN

🔥 Topic 2: Strategy 1: Dropping Data (`.dropna()`)

⚠️ Warning: Data Loss

💻 Example: Dropping Rows & Columns

💧 Topic 3: Strategy 2: Filling Data (`.fillna()`)

💡 Common Imputation Techniques:

💻 Example: Filling with Mean/Median

➡️ Topic 4: Time Series Filling (`ffill` / `bfill`)

Forward Fill (`ffill`)

Backward Fill (`bfill`)

💻 Example: ffill / bfill

🔄 Topic 5: The `.replace()` Method

💻 Example: Replacing Strings to NaN

📚 Module Summary

🤔 Interview Q&A

Practice Question

Loading Question...

Upload Your Own CSV

Output Console

Handling Missing Data

🧀 Introduction: The "Swiss Cheese" Data

🔍 Topic 1: Detecting Missing Values

🎯 Detection Workflow:

💻 Example: Counting NaN

🔥 Topic 2: Strategy 1: Dropping Data (.dropna())

⚠️ Warning: Data Loss

💻 Example: Dropping Rows & Columns

💧 Topic 3: Strategy 2: Filling Data (.fillna())

💡 Common Imputation Techniques:

💻 Example: Filling with Mean/Median

➡️ Topic 4: Time Series Filling (ffill / bfill)

Forward Fill (ffill)

Backward Fill (bfill)

💻 Example: ffill / bfill

🔄 Topic 5: The .replace() Method

💻 Example: Replacing Strings to NaN

📚 Module Summary

🤔 Interview Q&A

Practice Question

Loading Question...

Upload Your Own CSV

Output Console

🔥 Topic 2: Strategy 1: Dropping Data (`.dropna()`)

💧 Topic 3: Strategy 2: Filling Data (`.fillna()`)

➡️ Topic 4: Time Series Filling (`ffill` / `bfill`)

Forward Fill (`ffill`)

Backward Fill (`bfill`)

🔄 Topic 5: The `.replace()` Method