Merging and Joining DataFrames
Combine multiple datasets using pd.concat() and various pd.merge() joins.
🔗 Introduction: The Database Analogy
In Data Science, information is rarely in a single table. You might have one table for **User Details** and another for **User Orders**. You need methods to combine them correctly.
Pandas offers two primary methods for combination, similar to working with SQL tables: **Concatenation** and **Merging**.
📊 The Core Distinction:
1. Concatenation (pd.concat): Best for **stacking** tables that have the same structure. (Adding rows or columns without looking for common values.)
2. Merging (pd.merge): Best for **joining** tables based on a common key column (like an 'ID'). (Database-style joins.)
⬇️ Topic 1: pd.concat() - Stacking & Tiling
The **pd.concat()** function tiles DataFrames along an **axis**. This is ideal when data is split across multiple files but shares the same columns.
Concatenating Rows (axis=0)
This is the **default** behavior. It stacks data vertically (one DataFrame on top of the other).
🤝 Topic 2: pd.merge() - The Relational Join
The **pd.merge()** function is used to join tables based on the values in matching columns. This is essential for combining **User IDs** with their **Transaction History**.
🎯 Types of Joins (The 'how' parameter):
- Inner: Keeps only rows where the key exists in **BOTH** tables. (Intersection)
- Left: Keeps **ALL** rows from the left table, adding matching data from the right. (Safe way to preserve base data)
- Right: Keeps **ALL** rows from the right table, adding matching data from the left.
- Outer: Keeps **ALL** rows from both tables. Missing matches are filled with
NaN. (Union)
💻 Example: Left Merge
⚙️ Topic 3: Advanced Merging - Mismatched Keys
Sometimes the key columns have different names in the two DataFrames (e.g., 'User_ID' in the left table and 'Customer_ID' in the right). You must specify both columns separately.
💻 Example: Left_on / Right_on
📌 Topic 4: Index-Based Joining: df.join()
The **df.join()** method is a convenient shortcut used for combining DataFrames **based on their Index** (row labels). It defaults to a left join.
It's especially useful when you have created summarized data (like a total sales table) and want to attach it back to the original DataFrame based on the shared index.
💻 Example: Index Join
📚 Module Summary
- pd.concat(): Stacking (
axis=0) or Tiling (axis=1) DataFrames. - pd.merge(): Database-style joins using a common column (
on='key'). - Inner Join: Only matching records (Intersection).
- Left Join: All left records + matches from right (Preserves base data).
- pd.join(): Shortcut for merging based on the DataFrame's **Index**.
🤔 Interview Q&A
Tap on the questions below to reveal the answers.
Concatenation combines DataFrames by adding them end-to-end (stacking/tiling), ignoring the column content. **Merging** combines DataFrames by aligning them based on values in a common key column (like 'ID').
You use a **Left Join** when you need to retain *all records* from the primary (left) DataFrame, even if there is no match in the secondary table. You use an **Inner Join** only when you need *complete pairs* of data (matches in both).
You use the parameters **left_on** and **right_on** instead of the simple on parameter. Example: pd.merge(df1, df2, left_on='UserID', right_on='Customer_ID').
The danger is creating duplicate index values. If both DataFrames start their index at 0, the resulting combined DataFrame will have duplicate row labels (0, 1, 0, 1, ...). Using **ignore_index=True** fixes this by assigning a fresh, sequential index.