Getting Started (Pandas/NumPy)

Load your data and take the first look using Pandas.

Data Cleaning Beginner 30 min

🔬 Introduction: NumPy & Pandas - The Lab Kit

In Data Science, we deal with millions of data points. Standard Python lists are too slow. We need specialized tools.

1. NumPy (Numerical Python): The ultra-fast **Engine**. It provides **Vectorization**—the ability to perform calculations on entire arrays of data simultaneously, making it hundreds of times faster than Python loops.
2. Pandas: The **Smart Spreadsheet**. Built on top of NumPy, it provides labeled tables (`DataFrames`) designed specifically for real-world tasks like loading CSVs, cleaning messy data, and analyzing statistics.

            import numpy as np

            import pandas as pd  # These aliases are mandatory standard practice!

🐼 Topic 2: DataFrame vs Series

Pandas uses two primary data structures to organize data:

DataFrame (2D):

A DataFrame is the entire table. It has labeled **Columns** (Headers like 'Name', 'Age') and **Index** (Row numbers/labels). It can hold different data types in different columns.

Series (1D):

A Series is just a single column or a single row of a DataFrame. It is similar to a NumPy array but has labels (index).

📍 Topic 3: Data Indexing - `loc` vs `iloc`

Unlike Python lists that use only position (0, 1, 2), DataFrames can be accessed in two ways. This distinction is vital for writing bug-free code.

1. `.loc[]` (Label-based)

Use **actual row/column names** (labels). If you want the 'Age' column, you must write `df.loc[:, 'Age']`.

2. `.iloc[]` (Integer-based)

Use **integer positions** (0, 1, 2). If you want the first column, you must write `df.iloc[:, 0]`. It works just like Python list indexing.

            💻 Example: Accessing a Row
            # Assume index starts at 0, and column 'Name' is present

            first_row_by_label = df.loc[0]

            name_by_pos        = df.iloc[0, 0] # Row 0, Column 0

            age_column         = df.loc[:, 'Age'] # All rows, Age column

🔎 Topic 4: The Golden Commands (Inspection)

These three functions are essential for getting a quick feel for your data's quality and structure.

1. df.info()

**The Missing Data Detector.** Shows the total number of non-null values. If this number is lower than the total rows, you have **Missing Data (NaN)**.

2. df.describe()

**The Statistics Reporter.** Gives instant mean, median, min, max, and count for **all numerical columns**. Great for quickly spotting outliers (e.g., if max age is 300).

3. df.head() / df.tail()

**The Quick Look.** Shows the first/last 5 rows to ensure data loaded correctly.

🧹 Topic 5: Initial Cleaning - Missing Values

Since `df.info()` showed missing values, our first cleaning task is to handle them.

[Image of data cleaning process flowchart]

📝 Steps to Handle Missing Data:

                # Step 1: Check how many are missing (Sum of True/False)

                df.isna().sum()

                # Step 2A: Remove all rows with ANY missing value (Use with caution)

                df_cleaned = df.dropna()

                # Step 2B: Fill missing values with a calculated number (Imputation)

                median_age = df['Age'].median()

                df['Age'].fillna(median_age, inplace=True)

📚 Module Summary

NumPy: Fast arrays, the engine of data science.
Pandas: DataFrames (tables), the main interface.
Indexing: Use loc (labels/names) or iloc (position numbers).
Inspection: head(), info(), describe() are the first three commands.
Cleaning: Use isna().sum() to check missing values, and fillna() to fix them.

🤔 Interview Q&A

Tap on the questions below to reveal the answers.

NumPy arrays are faster because they support **Vectorized Operations** (running calculations on entire blocks of memory simultaneously) and store data of a single, uniform type, unlike Python lists which store pointers to objects of varying types.

.loc: Accesses data based on **Labels** (names) for both rows and columns. (e.g., df.loc[2, 'Name'])
.iloc: Accesses data based on **Integer Positions** (numbers) for both rows and columns. (e.g., df.iloc[2, 0])

df.info() provides crucial metadata: the **Data Type (dtype)** of each column and the **count of non-null values**. This immediately indicates which columns have missing data (NaN), which df.head() cannot do accurately.

df.dropna() removes any row that contains **any** missing value. The danger is that if you have a lot of columns and only scattered missing values, you might end up deleting most of your dataset, losing valuable clean data.

Getting Started (Pandas/NumPy)

🔬 Introduction: NumPy & Pandas - The Lab Kit

🐼 Topic 2: DataFrame vs Series

DataFrame (2D):

Series (1D):

📍 Topic 3: Data Indexing - `loc` vs `iloc`

1. `.loc[]` (Label-based)

2. `.iloc[]` (Integer-based)

💻 Example: Accessing a Row

🔎 Topic 4: The Golden Commands (Inspection)

🧹 Topic 5: Initial Cleaning - Missing Values

📝 Steps to Handle Missing Data:

📚 Module Summary

🤔 Interview Q&A

Practice Question

Loading Question...

Output Console

Getting Started (Pandas/NumPy)

🔬 Introduction: NumPy & Pandas - The Lab Kit

🐼 Topic 2: DataFrame vs Series

DataFrame (2D):

Series (1D):

📍 Topic 3: Data Indexing - loc vs iloc

1. .loc[] (Label-based)

2. .iloc[] (Integer-based)

💻 Example: Accessing a Row

🔎 Topic 4: The Golden Commands (Inspection)

🧹 Topic 5: Initial Cleaning - Missing Values

📝 Steps to Handle Missing Data:

📚 Module Summary

🤔 Interview Q&A

Practice Question

Loading Question...

Output Console

📍 Topic 3: Data Indexing - `loc` vs `iloc`

1. `.loc[]` (Label-based)

2. `.iloc[]` (Integer-based)