DSPython - Merging & Joining

🔗 Introduction: The Database Analogy

In Data Science, information is rarely in a single table. You might have one table for **User Details** and another for **User Orders**. You need methods to combine them correctly.

Pandas offers two primary methods for combination, similar to working with SQL tables: **Concatenation** and **Merging**.

📊 The Core Distinction:

1. Concatenation (pd.concat): Best for **stacking** tables that have the same structure. (Adding rows or columns without looking for common values.)
2. Merging (pd.merge): Best for **joining** tables based on a common key column (like an 'ID'). (Database-style joins.)

⬇️ Topic 1: `pd.concat()` - Stacking & Tiling

The **pd.concat()** function tiles DataFrames along an **axis**. This is ideal when data is split across multiple files but shares the same columns.

Concatenating Rows (`axis=0`)

This is the **default** behavior. It stacks data vertically (one DataFrame on top of the other).

# Stacks df2 below df1 (adding rows)
result = pd.concat([df1, df2], axis=0, ignore_index=True)
                
# ignore_index=True resets the row numbers (index)

🤝 Topic 2: `pd.merge()` - The Relational Join

The **pd.merge()** function is used to join tables based on the values in matching columns. This is essential for combining **User IDs** with their **Transaction History**.

🎯 Types of Joins (The 'how' parameter):

Inner: Keeps only rows where the key exists in **BOTH** tables. (Intersection)
Left: Keeps **ALL** rows from the left table, adding matching data from the right. (Safe way to preserve base data)
Right: Keeps **ALL** rows from the right table, adding matching data from the left.
Outer: Keeps **ALL** rows from both tables. Missing matches are filled with NaN. (Union)

💻 Example: Left Merge# Find all customer details and their corresponding orders
result = pd.merge(customers_df, orders_df, on='customer_id', how='left')
            
# If a customer has no orders, the 'orders' column will show NaN.

⚙️ Topic 3: Advanced Merging - Mismatched Keys

Sometimes the key columns have different names in the two DataFrames (e.g., 'User_ID' in the left table and 'Customer_ID' in the right). You must specify both columns separately.

💻 Example: Left_on / Right_ondf_users = pd.DataFrame({'UserID': [1, 2]})
            
df_profiles = pd.DataFrame({'Customer': [2, 1], 'Age': [30, 25]})
            
# Join based on UserID from left and Customer from right
result = pd.merge(df_users, df_profiles, left_on='UserID', right_on='Customer')
            
print(result)

📌 Topic 4: Index-Based Joining: `df.join()`

The **df.join()** method is a convenient shortcut used for combining DataFrames **based on their Index** (row labels). It defaults to a left join.

It's especially useful when you have created summarized data (like a total sales table) and want to attach it back to the original DataFrame based on the shared index.

💻 Example: Index Join# df_A has Index 0, 1, 2. df_B has Index 1, 2, 3
df_A = pd.DataFrame({'A': [1, 2, 3]})
            
df_B = pd.DataFrame({'B': [4, 5, 6]}, index=[1, 2, 3])
            
# Join df_B onto df_A based on matching row index
result = df_A.join(df_B, how='left')
            
print(result)
            
# Output: B column will have NaN for Index 0, since df_B started at Index 1.

📚 Module Summary

pd.concat(): Stacking (axis=0) or Tiling (axis=1) DataFrames.
pd.merge(): Database-style joins using a common column (on='key').
Inner Join: Only matching records (Intersection).
Left Join: All left records + matches from right (Preserves base data).
pd.join(): Shortcut for merging based on the DataFrame's **Index**.

🤔 Interview Q&A

Tap on the questions below to reveal the answers.

Concatenation combines DataFrames by adding them end-to-end (stacking/tiling), ignoring the column content. **Merging** combines DataFrames by aligning them based on values in a common key column (like 'ID').

You use a **Left Join** when you need to retain *all records* from the primary (left) DataFrame, even if there is no match in the secondary table. You use an **Inner Join** only when you need *complete pairs* of data (matches in both).

You use the parameters **left_on** and **right_on** instead of the simple on parameter. Example: pd.merge(df1, df2, left_on='UserID', right_on='Customer_ID').

The danger is creating duplicate index values. If both DataFrames start their index at 0, the resulting combined DataFrame will have duplicate row labels (0, 1, 0, 1, ...). Using **ignore_index=True** fixes this by assigning a fresh, sequential index.

Merging and Joining DataFrames

🔗 Introduction: The Database Analogy

📊 The Core Distinction:

⬇️ Topic 1: `pd.concat()` - Stacking & Tiling

Concatenating Rows (`axis=0`)

🤝 Topic 2: `pd.merge()` - The Relational Join

🎯 Types of Joins (The 'how' parameter):

💻 Example: Left Merge

⚙️ Topic 3: Advanced Merging - Mismatched Keys

💻 Example: Left_on / Right_on

📌 Topic 4: Index-Based Joining: `df.join()`

💻 Example: Index Join

📚 Module Summary

🤔 Interview Q&A

Practice Question

Loading Question...

Upload Your Own CSV

Output Console

Merging and Joining DataFrames

🔗 Introduction: The Database Analogy

📊 The Core Distinction:

⬇️ Topic 1: pd.concat() - Stacking & Tiling

Concatenating Rows (axis=0)

🤝 Topic 2: pd.merge() - The Relational Join

🎯 Types of Joins (The 'how' parameter):

💻 Example: Left Merge

⚙️ Topic 3: Advanced Merging - Mismatched Keys

💻 Example: Left_on / Right_on

📌 Topic 4: Index-Based Joining: df.join()

💻 Example: Index Join

📚 Module Summary

🤔 Interview Q&A

Practice Question

Loading Question...

Upload Your Own CSV

Output Console

⬇️ Topic 1: `pd.concat()` - Stacking & Tiling

Concatenating Rows (`axis=0`)

🤝 Topic 2: `pd.merge()` - The Relational Join

📌 Topic 4: Index-Based Joining: `df.join()`