DSPython Logo DSPython

Getting Started (Pandas/NumPy)

Load your data and take the first look using Pandas.

Data Cleaning Beginner 30 min

๐Ÿ”ฌ Introduction: NumPy & Pandas - The Lab Kit

In Data Science, we deal with millions of data points. Standard Python lists are too slow. We need specialized tools.

1. NumPy (Numerical Python): The ultra-fast **Engine**. It provides **Vectorization**โ€”the ability to perform calculations on entire arrays of data simultaneously, making it hundreds of times faster than Python loops.
2. Pandas: The **Smart Spreadsheet**. Built on top of NumPy, it provides labeled tables (`DataFrames`) designed specifically for real-world tasks like loading CSVs, cleaning messy data, and analyzing statistics.

import numpy as np
import pandas as pd ย # These aliases are mandatory standard practice!

๐Ÿผ Topic 2: DataFrame vs Series

Pandas uses two primary data structures to organize data:

DataFrame (2D):

A DataFrame is the entire table. It has labeled **Columns** (Headers like 'Name', 'Age') and **Index** (Row numbers/labels). It can hold different data types in different columns.

Series (1D):

A Series is just a single column or a single row of a DataFrame. It is similar to a NumPy array but has labels (index).

๐Ÿ“ Topic 3: Data Indexing - loc vs iloc

Unlike Python lists that use only position (0, 1, 2), DataFrames can be accessed in two ways. This distinction is vital for writing bug-free code.

1. .loc[] (Label-based)

Use **actual row/column names** (labels). If you want the 'Age' column, you must write `df.loc[:, 'Age']`.

2. .iloc[] (Integer-based)

Use **integer positions** (0, 1, 2). If you want the first column, you must write `df.iloc[:, 0]`. It works just like Python list indexing.

๐Ÿ’ป Example: Accessing a Row

# Assume index starts at 0, and column 'Name' is present
first_row_by_label = df.loc[0]
name_by_pos ย  ย  ย  ย = df.iloc[0, 0] # Row 0, Column 0
age_column ย  ย  ย  ย  = df.loc[:, 'Age'] # All rows, Age column

๐Ÿ”Ž Topic 4: The Golden Commands (Inspection)

These three functions are essential for getting a quick feel for your data's quality and structure.

1. df.info()

**The Missing Data Detector.** Shows the total number of non-null values. If this number is lower than the total rows, you have **Missing Data (NaN)**.

2. df.describe()

**The Statistics Reporter.** Gives instant mean, median, min, max, and count for **all numerical columns**. Great for quickly spotting outliers (e.g., if max age is 300).

3. df.head() / df.tail()

**The Quick Look.** Shows the first/last 5 rows to ensure data loaded correctly.

๐Ÿงน Topic 5: Initial Cleaning - Missing Values

Since `df.info()` showed missing values, our first cleaning task is to handle them.

[Image of data cleaning process flowchart]

๐Ÿ“ Steps to Handle Missing Data:

# Step 1: Check how many are missing (Sum of True/False)
df.isna().sum()

# Step 2A: Remove all rows with ANY missing value (Use with caution)
df_cleaned = df.dropna()

# Step 2B: Fill missing values with a calculated number (Imputation)
median_age = df['Age'].median()
df['Age'].fillna(median_age, inplace=True)

๐Ÿ“š Module Summary

  • NumPy: Fast arrays, the engine of data science.
  • Pandas: DataFrames (tables), the main interface.
  • Indexing: Use loc (labels/names) or iloc (position numbers).
  • Inspection: head(), info(), describe() are the first three commands.
  • Cleaning: Use isna().sum() to check missing values, and fillna() to fix them.

๐Ÿค” Interview Q&A

Tap on the questions below to reveal the answers.

NumPy arrays are faster because they support **Vectorized Operations** (running calculations on entire blocks of memory simultaneously) and store data of a single, uniform type, unlike Python lists which store pointers to objects of varying types.

.loc: Accesses data based on **Labels** (names) for both rows and columns. (e.g., df.loc[2, 'Name'])
.iloc: Accesses data based on **Integer Positions** (numbers) for both rows and columns. (e.g., df.iloc[2, 0])

df.info() provides crucial metadata: the **Data Type (dtype)** of each column and the **count of non-null values**. This immediately indicates which columns have missing data (NaN), which df.head() cannot do accurately.

df.dropna() removes any row that contains **any** missing value. The danger is that if you have a lot of columns and only scattered missing values, you might end up deleting most of your dataset, losing valuable clean data.