DSPython Logo DSPython

Introduction to Data Cleaning

Learn why clean data is the most important step in data science.

Data Science Beginner 15 min

๐Ÿงน Introduction: The "Dirty Laundry" Analogy

Imagine you are about to cook a delicious meal. Would you throw unwashed vegetables, potato skins, and even a few pebbles directly into the pot? No! You wash, peel, and chop them first.

Data Cleaning is exactly that. Real-world data is "dirty"โ€”it's messy, incomplete, and full of errors. If you feed this dirty data into an AI model, you get "Garbage In, Garbage Out".

Before we can analyze data or build cool AI models, we must spend 60-80% of our time cleaning the data. It is the most crucial skill for a Data Scientist.

๐Ÿงผ Topic 1: What is Data Cleaning?

Data cleaning (or data cleansing) is the process of detecting and correcting (or removing) corrupt or inaccurate records from a dataset.

[Image of data cleaning process flowchart]

๐ŸŽฏ The Goal:

To transform raw, messy data into a clean, consistent format that computers can understand and analyze.

๐Ÿ’ป Example: Dirty vs Clean

# Dirty Data (What you get)
raw_data = [
{"Name": "Vinay", "Age": "25"}, ย # Age is string
{"Name": "Priya", "Age": None}, ย # Age is missing
{"Name": "Vinay", "Age": 25} ย  # Duplicate!
]

# Clean Data (What you want)
clean_data = [
{"Name": "Vinay", "Age": 25},
{"Name": "Priya", "Age": 24} ย # Filled missing value
]

๐Ÿฆ  Topic 2: Types of "Dirt" in Data

Data doesn't come clean. It comes from humans (who make mistakes) or sensors (which can fail). Here are the most common problems:

1. Missing Data (NaN)

Information is simply not there. A user skipped a form field, or a sensor lost connection. In Python, this is often represented as NaN (Not a Number) or None.

2. Duplicates

The same record appears multiple times. This can happen if a user clicks "Submit" twice or if data is merged from different sources incorrectly.

3. Inconsistent Data

Different formats for the same thing. Example: "New York", "NY", "new york", "NYC". The computer treats these as 4 different cities!

4. Outliers

Values that are impossibly high or low. Example: A user's age listed as 200 years, or a house price listed as $5.

๐Ÿ“‹ Topic 3: Our Cleaning Workflow

In this course, we will follow a standard industry workflow to clean our data using the Pandas library.

  • 1. Inspection: Load the data and look at it (`.head()`, `.info()`).
  • 2. Handling Missing Values: Decide whether to delete rows (`.dropna()`) or fill them (`.fillna()`).
  • 3. Removing Duplicates: Identify and remove repeated copies (`.drop_duplicates()`).
  • 4. Data Formatting: Fix data types (e.g., convert string "25" to integer 25).
  • 5. Outlier Treatment: Find and fix extreme values using statistics.

๐Ÿ“š Module Summary

  • Data Cleaning: Fixing errors in raw data.
  • Garbage In, Garbage Out: Bad data leads to bad models.
  • NaN: Represents missing data in Python (Not a Number).
  • Pandas: The most popular Python library for data cleaning.

๐Ÿค” Interview Q&A

Tap on the questions below to reveal the answers.

Data cleaning is crucial because real-world data is messy. If we don't clean it, our analysis will be inaccurate, and machine learning models will fail or give wrong predictions ("Garbage In, Garbage Out").

NaN stands for "Not a Number". It is a special floating-point value used in Pandas and NumPy to represent missing or undefined data.

There are two main ways:
1. Dropping: Remove the rows/columns with missing data (good if you have lots of data).
2. Imputation: Fill the missing values with the mean, median, or mode (better to preserve data).

An outlier is a data point that differs significantly from other observations. For example, if everyone's salary is around $50k, but one person earns $50M, that person is an outlier. Outliers can skew statistical analysis.