Handling Missing Data
Learn to find, remove, and fill missing values (NaN) in your data.
🧀 Introduction: The "Swiss Cheese" Data
When you collect data from the real world, it rarely comes perfect. Just like Swiss cheese, your data table will have **holes** (missing values). These holes are a major problem for machine learning models.
Most statistical and ML algorithms (especially simpler ones like Linear Regression) **cannot process** a row that contains even a single missing value. Therefore, handling missing data is the very first critical step in cleaning.
In Pandas, missing data is usually represented by the special value **`NaN`** (Not a Number), which comes from the NumPy library.
🔍 Topic 1: Detecting Missing Values
You use two key methods sequentially: .isnull() (or .isna()) to identify the holes, and .sum() to count them.
🎯 Detection Workflow:
- **Step 1:** `df.isnull()` returns a boolean map (True/False).
- **Step 2:** `.sum()` treats `True` as 1 and `False` as 0, giving you the total count per column.
💻 Example: Counting NaN
data = {'A': [1, 2, np.nan], 'B': [5, np.nan, np.nan], 'C': [1, 2, 3]}
df = pd.DataFrame(data)
print("\n--- Missing Count ---")
print(df.isnull().sum())
🔥 Topic 2: Strategy 1: Dropping Data (.dropna())
If you have a huge dataset (thousands of rows) and only a small percentage of rows have missing data, the fastest and safest solution is to simply remove those rows.
⚠️ Warning: Data Loss
Dropping data means permanently losing the remaining valid information in those rows/columns. Only use this when the data loss is minimal or when imputation is impossible.
💻 Example: Dropping Rows & Columns
💧 Topic 3: Strategy 2: Filling Data (.fillna())
**Imputation** is the process of filling in the missing data with calculated values. This is generally preferred over dropping, as it preserves the structure of your dataset.
💡 Common Imputation Techniques:
- **Mean:** Fill with the average of the column (best for normally distributed data).
- **Median:** Fill with the middle value (best if there are outliers).
- **Mode:** Fill with the most frequent value (best for categorical/non-numeric data).
- **Zero/Constant:** Fill with a fixed value (e.g., setting a missing price to $0).
💻 Example: Filling with Mean/Median
print(df)
➡️ Topic 4: Time Series Filling (ffill / bfill)
When dealing with time-series data (like stock prices or sensor readings), using the Mean might be inaccurate. Instead, we use observational filling methods.
Forward Fill (ffill)
Fills the NaN hole with the value of the **last valid observation** before it. (e.g., If the sensor reading is missing now, use the value from 5 minutes ago).
Backward Fill (bfill)
Fills the NaN hole with the value of the **next valid observation** after it. (e.g., Fill a missing price with the price that came 5 minutes later).
💻 Example: ffill / bfill
print("ffill:")
print(s.fillna(method='ffill')) # Output: [10, 10, 12, 12]
print("\nbfill:")
print(s.fillna(method='bfill')) # Output: [10, 12, 12, NaN] (Last one stays NaN)
🔄 Topic 5: The .replace() Method
Pandas' powerful `isnull()` only detects NumPy's `NaN`. If your missing data is represented by strings like **'N/A'**, **'?'**, or **-99**, you must convert them to `np.nan` first.
💻 Example: Replacing Strings to NaN
df_dirty = pd.DataFrame({'Income': [50000, '?', 'N/A']})📚 Module Summary
- **Detect:** Use
.isnull().sum()to find the count of `NaN`s. - **Drop:** Use
.dropna()to remove rows/columns (use only if data loss is small). - **Impute:** Use
.fillna()to replace `NaN` with Mean, Median, or Mode. - **Time Series:** Use
ffill(last valid value) orbfill(next valid value). - **Replace:** Use
.replace()to convert custom markers ('?', 'N/A') to `np.nan`.
🤔 Interview Q&A
Tap on the questions below to reveal the answers.
Mean imputation is a bad choice when your data has **Outliers** or if the data distribution is heavily **skewed**. In these cases, the mean can be misleading. It is safer to use the **Median** for imputation.
NumPy arrays (the base for Pandas) are designed for fast numerical processing. If Pandas used None, the column would be forced to the slower Python `object` dtype. By using `np.nan`, Pandas can maintain the fast `float` dtype even with missing values.
axis=0 (default) drops rows. If your dataset has many rows but missing values scattered across them, dropna(axis=0) can delete most of your dataset, severely reducing the size and statistical power of your data.
You use **ffill (forward fill)** primarily with **Time Series Data** (e.g., stock prices, temperature). In this context, the value that occurred immediately before the missing value is the best estimate, not the long-term average (mean).