DSPython Logo DSPython

Handling Duplicates & Inconsistency

Learn to find and remove duplicate data and clean messy text values.

Data Science Beginner 45 min

🧽 Introduction: Fixing Duplicates & Messy Text

We've fixed the missing data holes (NaN). Now, we tackle the next two common issues: **Duplicates** and **Inconsistent String Formats**.

If the same customer is registered twice (a duplicate), your model will think they are two different people. If the same city is spelled "New York" and "new york", your model will also treat them as two different places.

This module teaches you to standardize text and ensure every customer or entry is represented only once.

[Image of data cleaning process flowchart]

👯 Topic 1: Detecting Duplicates: .duplicated()

The core of duplicate handling is knowing which row is the original and which one is the copy (the **echo**).

💡 How it works:

The .duplicated() method compares each row to all preceding rows. If an exact match is found, it returns **True** for the matching row (the copy). The very first instance is marked **False** (the original).

💻 Example: Finding Copies

import pandas as pd
data = {'ID': [1, 2, 1, 3, 2], 'Score': [90, 80, 90, 75, 80]}
df = pd.DataFrame(data)

print("--- Duplicates Series ---")
print(df.duplicated())
# Output:
# 0: False (Original)
# 1: False (Original)
# 2: True (Copy of row 0)
# 3: False (Original)
# 4: True (Copy of row 1)

✂️ Topic 2: Removing Duplicates: .drop_duplicates()

Once detected, you use `drop_duplicates()` to generate a clean DataFrame without the unwanted copies.

🔑 The 'keep' Parameter:

This controls which version of the duplicate row should survive the cut:

  • keep='first': (Default) The first seen row stays.
  • keep='last': The latest seen row stays.
  • keep=False: **Drops all** instances of the duplicated row (e.g., Row 0, Row 2, Row 4 all get removed).

💻 Example: Using 'keep'

# Using the original 'df'
df_last = df.drop_duplicates(keep='last')
print("--- Keeping Last ---")
print(df_last)
df_none = df.drop_duplicates(keep=False)
print("--- Dropping All Duplicates ---")
print(df_none)

🔑 Topic 3: Duplicates in Subsets

Sometimes, two rows might have the same 'Name' and 'Age', but a different 'City'. If you only want to check if the 'Name' is duplicated, you use the `subset` parameter.

💻 Example: Checking Duplicates by Name Only

data = {'Name': ['A', 'B', 'A'], 'City': ['NY', 'LA', 'Chicago']}
df = pd.DataFrame(data)

# 1. Detect duplicates based only on the 'Name' column
print(df.duplicated(subset=['Name']))
# Output: 0: False, 1: False, 2: True (Row 2 is a duplicate of Row 0 by Name)

# 2. Drop duplicates based on Name, keeping the first one
df_unique_names = df.drop_duplicates(subset=['Name'], keep='first')
print(df_unique_names)

✍️ Topic 4: Handling Inconsistent Data

Inconsistent data is often caused by human entry errors (bad handwriting). We use the **`.str` accessor** to apply powerful string methods to an entire Pandas Series (column).

📝 String Standardization Checklist:

  • **Step 1 (Whitespace):** `.str.strip()` — Remove surrounding spaces.
  • **Step 2 (Case):** `.str.lower()` — Convert all to lowercase.
  • **Step 3 (Symbols):** `.str.replace()` — Remove or fix unwanted symbols/abbreviations.

💻 Example: Chaining String Operations

data = {'city': [' New York ', 'chicago', 'NY.']}
df = pd.DataFrame(data)

# Fix: Lowercase, remove whitespace, remove '.' symbol
df['city_cleaned'] = (df['city'].str.strip()     # Step 1: Fixes ' New York '
                                         .str.lower()          # Step 2: Fixes 'chicago' vs 'Chicago'
                                         .str.replace('.', ''))  # Step 3: Fixes 'NY.'
print(df)

📚 Module Summary

  • **Detect Duplicates:** Use .duplicated() to find copies.
  • **Remove Duplicates:** Use .drop_duplicates().
  • **Subset:** Use subset=['col'] to check duplicates only in specific columns.
  • **String Cleaning:** Use the .str accessor for column-wide fixes.
  • **Standardize:** Always convert text to .str.lower() and .str.strip().

🤔 Interview Q&A

Tap on the questions below to reveal the answers.

df.duplicated(keep=False) returns a list of rows where **both** the original and the subsequent copies are marked as True. When used with drop_duplicates(), it removes **all** copies of the duplicated row, including the very first instance.

The subset parameter (e.g., subset=['Name', 'Email']) tells Pandas to only check for duplicate values within those specific columns. If a row has the same Name and Email but a different Address, it will still be considered a duplicate based on the subset rule.

If you have 'NEW YORK' and 'New York', Pandas treats them as two unique strings. By using .str.lower() first, both become 'new york', allowing drop_duplicates() to correctly identify the second one as a copy and remove it.

Yes. By default, drop_duplicates() returns a new DataFrame. To modify the DataFrame in place without creating a new variable, you can use the parameter inplace=True (e.g., df.drop_duplicates(inplace=True)).