Data Transformation
Learn to apply functions, group data, and create new insights with Pandas.
π Introduction: The Data Transformation Factory
We have already cleaned the data (fixed NaNs, duplicates). Now, we move to **Data Transformation**. This is like a factory where raw data is converted into *specialized products* (new features or powerful summarized reports).
The goal of transformation is **Feature Engineering**, the art of creating predictive features for machine learning models.
β Topic 1: Vectorization (Fast Column Operations)
The fastest way to create new columns is by performing arithmetic operations directly on the columns. This uses **Vectorization**βthe operation runs on the entire column in *parallel* at **C-speed**, bypassing slow Python iteration.
π» Example: Calculating Profit
data = {'Revenue': [100, 200], 'Cost': [30, 50]}df = pd.DataFrame(data)
# New column = Revenue - Cost (Vectorized Subtraction)
df['Profit'] = df['Revenue'] - df['Cost']
print(df)
π§ Topic 2: Row Logic: .apply(axis=1)
When calculation requires **conditional logic** based on *multiple columns in the same row*, Vectorization fails. You must use **.apply()** with the parameter axis=1 (meaning "apply function across rows").
We usually pass a **lambda** function to .apply(). The input to the lambda (`row` in the example) is the *entire row* of the DataFrame.
π» Example: Determining Eligibility
data = {'Age': [20, 16], 'Score': [80, 95]}df = pd.DataFrame(data)
# Condition: Must be > 18 AND Score > 70
df['Eligible'] = df.apply(lambda row: 'Yes' if (row['Age'] > 18 and row['Score'] > 70) else 'No', axis=1)
print(df)
πͺ Topic 3: Binning & Categorization: pd.cut()
**Binning** is the process of converting a continuous numerical variable (like age or income) into categorical groups (or **bins**). This helps simplify data for certain models. We use the **pd.cut()** function for this.
π» Example: Grouping Ages
data = {'Age': [15, 28, 45, 65]}df = pd.DataFrame(data)
# Define the boundaries (bins) and the category names (labels)
bins = [0, 18, 35, 60, 100]
labels = ['Teen', 'YoungAdult', 'Adult', 'Senior']
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
print(df)
π Topic 4: Multi-Aggregation: .agg()
The **.groupby()** method breaks the data into groups. Instead of just calculating the `sum()` or `mean()`, we use the **.agg()** method to perform *multiple aggregation functions* simultaneously on different columns.
π» Example: Mean and Max Sales by City
data = {'City': ['H', 'B', 'H'], 'Sales': [100, 50, 200], 'Units': [10, 5, 20]}df = pd.DataFrame(data)
# Group by City, then apply SUM and MEAN
city_summary = df.groupby('City').agg({
Β Β 'Sales': ['sum', 'max'],
Β Β 'Units': 'mean'
})
print(city_summary)
π Topic 5: Data Reshaping: .pivot_table()
A **Pivot Table** is essential for creating structured summary reports, similar to pivot tables in Excel. It aggregates data while restructuring the DataFrame using three key parameters: **index**, **columns**, and **values**.
π» Example: Pivot Table for Mean Sales
data = {'City': ['A', 'B', 'A'], 'Product': ['X', 'X', 'Y'], 'Price': [10, 20, 30]}df = pd.DataFrame(data)
# Index=City (Rows), Columns=Product (Headers), Values=Mean Price
report = pd.pivot_table(df, index='City', columns='Product', values='Price', aggfunc='mean')
print(report)
π·οΈ Topic 6: Simple Substitution: .map()
The **.map()** method is ideal for quickly replacing values in a Series (column) using a simple dictionary lookup.
π» Example: Rating Conversion
df['Rating'] = df['Score'].map({10: 'Low', 20: 'High'})π Module Summary
- Vectorization: Arithmetic on entire columns (fastest).
- .apply(axis=1): Used for conditional logic across rows.
- pd.cut(): Converts continuous numbers (like Age) into categories (Bins).
- .groupby().agg(): Performs multiple summary metrics (mean, max, sum) simultaneously.
- .pivot_table(): Creates structured, cross-tabulated reports.
π€ Interview Q&A
Tap on the questions below to reveal the answers.
You must choose .apply(axis=1) only when the transformation involves complex **conditional logic (if/else)** that depends on values across *multiple columns* in the *same row*. For simple math, Vectorization is always superior.
Binning simplifies the relationship between a continuous variable (like Salary) and the target variable. It reduces the impact of minor fluctuations and helps models focus on the categorical difference (e.g., separating "Low Income" from "High Income").
**.groupby()** outputs a summarized Series (or flattened DataFrame) with the grouping column as the index.
**pd.pivot_table()** outputs a 2D table where one column becomes the new index (rows) and another column becomes the new headers (columns), which is essential for structured reports.
The primary purpose of .agg() (Aggregation) is to perform **multiple statistical summaries** on a grouped dataset in a single operation (e.g., calculate the mean, minimum, and count of 'Sales' simultaneously after grouping by 'City').