DSPython Logo DSPython

Merging and Joining DataFrames

Combine multiple datasets using pd.concat() and various pd.merge() joins.

Data Science Intermediate 45 min

🔗 Introduction: The Database Analogy

In Data Science, information is rarely in a single table. You might have one table for **User Details** and another for **User Orders**. You need methods to combine them correctly.

Pandas offers two primary methods for combination, similar to working with SQL tables: **Concatenation** and **Merging**.

📊 The Core Distinction:

1. Concatenation (pd.concat): Best for **stacking** tables that have the same structure. (Adding rows or columns without looking for common values.)
2. Merging (pd.merge): Best for **joining** tables based on a common key column (like an 'ID'). (Database-style joins.)

⬇️ Topic 1: pd.concat() - Stacking & Tiling

The **pd.concat()** function tiles DataFrames along an **axis**. This is ideal when data is split across multiple files but shares the same columns.

Concatenating Rows (axis=0)

This is the **default** behavior. It stacks data vertically (one DataFrame on top of the other).

# Stacks df2 below df1 (adding rows)
result = pd.concat([df1, df2], axis=0, ignore_index=True)
# ignore_index=True resets the row numbers (index)

🤝 Topic 2: pd.merge() - The Relational Join

The **pd.merge()** function is used to join tables based on the values in matching columns. This is essential for combining **User IDs** with their **Transaction History**.

🎯 Types of Joins (The 'how' parameter):

  • Inner: Keeps only rows where the key exists in **BOTH** tables. (Intersection)
  • Left: Keeps **ALL** rows from the left table, adding matching data from the right. (Safe way to preserve base data)
  • Right: Keeps **ALL** rows from the right table, adding matching data from the left.
  • Outer: Keeps **ALL** rows from both tables. Missing matches are filled with NaN. (Union)

💻 Example: Left Merge

# Find all customer details and their corresponding orders
result = pd.merge(customers_df, orders_df, on='customer_id', how='left')
# If a customer has no orders, the 'orders' column will show NaN.

⚙️ Topic 3: Advanced Merging - Mismatched Keys

Sometimes the key columns have different names in the two DataFrames (e.g., 'User_ID' in the left table and 'Customer_ID' in the right). You must specify both columns separately.

💻 Example: Left_on / Right_on

df_users = pd.DataFrame({'UserID': [1, 2]})
df_profiles = pd.DataFrame({'Customer': [2, 1], 'Age': [30, 25]})

# Join based on UserID from left and Customer from right
result = pd.merge(df_users, df_profiles, left_on='UserID', right_on='Customer')
print(result)

📌 Topic 4: Index-Based Joining: df.join()

The **df.join()** method is a convenient shortcut used for combining DataFrames **based on their Index** (row labels). It defaults to a left join.

It's especially useful when you have created summarized data (like a total sales table) and want to attach it back to the original DataFrame based on the shared index.

💻 Example: Index Join

# df_A has Index 0, 1, 2. df_B has Index 1, 2, 3
df_A = pd.DataFrame({'A': [1, 2, 3]})
df_B = pd.DataFrame({'B': [4, 5, 6]}, index=[1, 2, 3])

# Join df_B onto df_A based on matching row index
result = df_A.join(df_B, how='left')
print(result)
# Output: B column will have NaN for Index 0, since df_B started at Index 1.

📚 Module Summary

  • pd.concat(): Stacking (axis=0) or Tiling (axis=1) DataFrames.
  • pd.merge(): Database-style joins using a common column (on='key').
  • Inner Join: Only matching records (Intersection).
  • Left Join: All left records + matches from right (Preserves base data).
  • pd.join(): Shortcut for merging based on the DataFrame's **Index**.

🤔 Interview Q&A

Tap on the questions below to reveal the answers.

Concatenation combines DataFrames by adding them end-to-end (stacking/tiling), ignoring the column content. **Merging** combines DataFrames by aligning them based on values in a common key column (like 'ID').

You use a **Left Join** when you need to retain *all records* from the primary (left) DataFrame, even if there is no match in the secondary table. You use an **Inner Join** only when you need *complete pairs* of data (matches in both).

You use the parameters **left_on** and **right_on** instead of the simple on parameter. Example: pd.merge(df1, df2, left_on='UserID', right_on='Customer_ID').

The danger is creating duplicate index values. If both DataFrames start their index at 0, the resulting combined DataFrame will have duplicate row labels (0, 1, 0, 1, ...). Using **ignore_index=True** fixes this by assigning a fresh, sequential index.