Getting Started (Pandas/NumPy)
Load your data and take the first look using Pandas.
๐ฌ Introduction: NumPy & Pandas - The Lab Kit
In Data Science, we deal with millions of data points. Standard Python lists are too slow. We need specialized tools.
1. NumPy (Numerical Python): The ultra-fast **Engine**. It provides **Vectorization**โthe ability to perform calculations on entire arrays of data simultaneously, making it hundreds of times faster than Python loops.
2. Pandas: The **Smart Spreadsheet**. Built on top of NumPy, it provides labeled tables (`DataFrames`) designed specifically for real-world tasks like loading CSVs, cleaning messy data, and analyzing statistics.
import pandas as pd ย # These aliases are mandatory standard practice!
๐ผ Topic 2: DataFrame vs Series
Pandas uses two primary data structures to organize data:
DataFrame (2D):
A DataFrame is the entire table. It has labeled **Columns** (Headers like 'Name', 'Age') and **Index** (Row numbers/labels). It can hold different data types in different columns.
Series (1D):
A Series is just a single column or a single row of a DataFrame. It is similar to a NumPy array but has labels (index).
๐ Topic 3: Data Indexing - loc vs iloc
Unlike Python lists that use only position (0, 1, 2), DataFrames can be accessed in two ways. This distinction is vital for writing bug-free code.
1. .loc[] (Label-based)
Use **actual row/column names** (labels). If you want the 'Age' column, you must write `df.loc[:, 'Age']`.
2. .iloc[] (Integer-based)
Use **integer positions** (0, 1, 2). If you want the first column, you must write `df.iloc[:, 0]`. It works just like Python list indexing.
๐ป Example: Accessing a Row
# Assume index starts at 0, and column 'Name' is presentfirst_row_by_label = df.loc[0]
name_by_pos ย ย ย ย = df.iloc[0, 0] # Row 0, Column 0
age_column ย ย ย ย = df.loc[:, 'Age'] # All rows, Age column
๐ Topic 4: The Golden Commands (Inspection)
These three functions are essential for getting a quick feel for your data's quality and structure.
**The Missing Data Detector.** Shows the total number of non-null values. If this number is lower than the total rows, you have **Missing Data (NaN)**.
**The Statistics Reporter.** Gives instant mean, median, min, max, and count for **all numerical columns**. Great for quickly spotting outliers (e.g., if max age is 300).
**The Quick Look.** Shows the first/last 5 rows to ensure data loaded correctly.
๐งน Topic 5: Initial Cleaning - Missing Values
Since `df.info()` showed missing values, our first cleaning task is to handle them.
[Image of data cleaning process flowchart]๐ Steps to Handle Missing Data:
df.isna().sum()
# Step 2A: Remove all rows with ANY missing value (Use with caution)
df_cleaned = df.dropna()
# Step 2B: Fill missing values with a calculated number (Imputation)
median_age = df['Age'].median()
df['Age'].fillna(median_age, inplace=True)
๐ Module Summary
- NumPy: Fast arrays, the engine of data science.
- Pandas: DataFrames (tables), the main interface.
- Indexing: Use
loc(labels/names) oriloc(position numbers). - Inspection:
head(),info(),describe()are the first three commands. - Cleaning: Use
isna().sum()to check missing values, andfillna()to fix them.
๐ค Interview Q&A
Tap on the questions below to reveal the answers.
NumPy arrays are faster because they support **Vectorized Operations** (running calculations on entire blocks of memory simultaneously) and store data of a single, uniform type, unlike Python lists which store pointers to objects of varying types.
.loc: Accesses data based on **Labels** (names) for both rows and columns. (e.g., df.loc[2, 'Name'])
.iloc: Accesses data based on **Integer Positions** (numbers) for both rows and columns. (e.g., df.iloc[2, 0])
df.info() provides crucial metadata: the **Data Type (dtype)** of each column and the **count of non-null values**. This immediately indicates which columns have missing data (NaN), which df.head() cannot do accurately.
df.dropna() removes any row that contains **any** missing value. The danger is that if you have a lot of columns and only scattered missing values, you might end up deleting most of your dataset, losing valuable clean data.