Data Transformation

Learn to apply functions, group data, and create new insights with Pandas.

Data Science Intermediate 60 min

🏭 Introduction: The Data Transformation Factory

We have already cleaned the data (fixed NaNs, duplicates). Now, we move to **Data Transformation**. This is like a factory where raw data is converted into *specialized products* (new features or powerful summarized reports).

The goal of transformation is **Feature Engineering**, the art of creating predictive features for machine learning models.

➕ Topic 1: Vectorization (Fast Column Operations)

The fastest way to create new columns is by performing arithmetic operations directly on the columns. This uses **Vectorization**—the operation runs on the entire column in *parallel* at **C-speed**, bypassing slow Python iteration.

            💻 Example: Calculating Profit
            data = {'Revenue': [100, 200], 'Cost': [30, 50]}

            df = pd.DataFrame(data)

            # New column = Revenue - Cost (Vectorized Subtraction)

            df['Profit'] = df['Revenue'] - df['Cost']

            print(df)

🧠 Topic 2: Row Logic: `.apply(axis=1)`

When calculation requires **conditional logic** based on *multiple columns in the same row*, Vectorization fails. You must use **.apply()** with the parameter axis=1 (meaning "apply function across rows").

We usually pass a **lambda** function to .apply(). The input to the lambda (`row` in the example) is the *entire row* of the DataFrame.

            💻 Example: Determining Eligibility
            data = {'Age': [20, 16], 'Score': [80, 95]}

            df = pd.DataFrame(data)

            # Condition: Must be > 18 AND Score > 70

            df['Eligible'] = df.apply(lambda row: 'Yes' if (row['Age'] > 18 and row['Score'] > 70) else 'No', axis=1)

            print(df)

🔪 Topic 3: Binning & Categorization: `pd.cut()`

**Binning** is the process of converting a continuous numerical variable (like age or income) into categorical groups (or **bins**). This helps simplify data for certain models. We use the **pd.cut()** function for this.

            💻 Example: Grouping Ages
            data = {'Age': [15, 28, 45, 65]}

            df = pd.DataFrame(data)

            # Define the boundaries (bins) and the category names (labels)

            bins = [0, 18, 35, 60, 100]

            labels = ['Teen', 'YoungAdult', 'Adult', 'Senior']

            df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

            print(df)

📊 Topic 4: Multi-Aggregation: `.agg()`

The **.groupby()** method breaks the data into groups. Instead of just calculating the `sum()` or `mean()`, we use the **.agg()** method to perform *multiple aggregation functions* simultaneously on different columns.

            💻 Example: Mean and Max Sales by City
            data = {'City': ['H', 'B', 'H'], 'Sales': [100, 50, 200], 'Units': [10, 5, 20]}

            df = pd.DataFrame(data)

            # Group by City, then apply SUM and MEAN

            city_summary = df.groupby('City').agg({

                'Sales': ['sum', 'max'],

                'Units': 'mean'

            })

            print(city_summary)

🔄 Topic 5: Data Reshaping: `.pivot_table()`

A **Pivot Table** is essential for creating structured summary reports, similar to pivot tables in Excel. It aggregates data while restructuring the DataFrame using three key parameters: **index**, **columns**, and **values**.

            💻 Example: Pivot Table for Mean Sales
            data = {'City': ['A', 'B', 'A'], 'Product': ['X', 'X', 'Y'], 'Price': [10, 20, 30]}

            df = pd.DataFrame(data)

            # Index=City (Rows), Columns=Product (Headers), Values=Mean Price

            report = pd.pivot_table(df, index='City', columns='Product', values='Price', aggfunc='mean')

            print(report)

🏷️ Topic 6: Simple Substitution: `.map()`

The **.map()** method is ideal for quickly replacing values in a Series (column) using a simple dictionary lookup.

            💻 Example: Rating Conversion
            df['Rating'] = df['Score'].map({10: 'Low', 20: 'High'})
        

📚 Module Summary

Vectorization: Arithmetic on entire columns (fastest).
.apply(axis=1): Used for conditional logic across rows.
pd.cut(): Converts continuous numbers (like Age) into categories (Bins).
.groupby().agg(): Performs multiple summary metrics (mean, max, sum) simultaneously.
.pivot_table(): Creates structured, cross-tabulated reports.

🤔 Interview Q&A

Tap on the questions below to reveal the answers.

You must choose .apply(axis=1) only when the transformation involves complex **conditional logic (if/else)** that depends on values across *multiple columns* in the *same row*. For simple math, Vectorization is always superior.

Binning simplifies the relationship between a continuous variable (like Salary) and the target variable. It reduces the impact of minor fluctuations and helps models focus on the categorical difference (e.g., separating "Low Income" from "High Income").

**.groupby()** outputs a summarized Series (or flattened DataFrame) with the grouping column as the index.
**pd.pivot_table()** outputs a 2D table where one column becomes the new index (rows) and another column becomes the new headers (columns), which is essential for structured reports.

The primary purpose of .agg() (Aggregation) is to perform **multiple statistical summaries** on a grouped dataset in a single operation (e.g., calculate the mean, minimum, and count of 'Sales' simultaneously after grouping by 'City').

Data Transformation

🏭 Introduction: The Data Transformation Factory

➕ Topic 1: Vectorization (Fast Column Operations)

💻 Example: Calculating Profit

🧠 Topic 2: Row Logic: `.apply(axis=1)`

💻 Example: Determining Eligibility

🔪 Topic 3: Binning & Categorization: `pd.cut()`

💻 Example: Grouping Ages

📊 Topic 4: Multi-Aggregation: `.agg()`

💻 Example: Mean and Max Sales by City

🔄 Topic 5: Data Reshaping: `.pivot_table()`

💻 Example: Pivot Table for Mean Sales

🏷️ Topic 6: Simple Substitution: `.map()`

💻 Example: Rating Conversion

📚 Module Summary

🤔 Interview Q&A

Practice Question

Loading Question...

Upload Your Own CSV

Output Console

Data Transformation

🏭 Introduction: The Data Transformation Factory

➕ Topic 1: Vectorization (Fast Column Operations)

💻 Example: Calculating Profit

🧠 Topic 2: Row Logic: .apply(axis=1)

💻 Example: Determining Eligibility

🔪 Topic 3: Binning & Categorization: pd.cut()

💻 Example: Grouping Ages

📊 Topic 4: Multi-Aggregation: .agg()

💻 Example: Mean and Max Sales by City

🔄 Topic 5: Data Reshaping: .pivot_table()

💻 Example: Pivot Table for Mean Sales

🏷️ Topic 6: Simple Substitution: .map()

💻 Example: Rating Conversion

📚 Module Summary

🤔 Interview Q&A

Practice Question

Loading Question...

Upload Your Own CSV

Output Console

🧠 Topic 2: Row Logic: `.apply(axis=1)`

🔪 Topic 3: Binning & Categorization: `pd.cut()`

📊 Topic 4: Multi-Aggregation: `.agg()`

🔄 Topic 5: Data Reshaping: `.pivot_table()`

🏷️ Topic 6: Simple Substitution: `.map()`