DSPython Logo DSPython

Merging and Joining DataFrames

Combine and reshape data using keys and indices with Pandas.

Data Science Core Intermediate 60 min

🀝 Introduction: The Relational Data Model

In Data Science, information is often split across multiple tables to avoid redundancy (Normalization). For instance, customer details are separate from their monthly transactions. To get a complete view, you must link these tables.

**Merging** and **Joining** refer to combining DataFrames horizontally based on common column values (known as **keys**), similar to SQL operations.

**Concatenation** is different; it is primarily used for stacking DataFrames vertically (appending rows) or side-by-side (appending columns) without key-based alignment.

[Image of SQL join venn diagram]

🎯 Topic 1: The Core Merge Function (`pd.merge`)

The `pd.merge()` function is the primary tool for combining DataFrames based on values in shared columns. It offers full control over the relational join operation.

πŸ“ Key `pd.merge()` Arguments:

pd.merge(left_df, right_df,
on='key_col',
how='inner',
suffixes=('_x', '_y'))

**1. `on`:** Single column name or a list of column names (e.g., `['ID', 'Date']`) to join on.
**2. `left_on`/`right_on`:** Used if the joining column names are different in the two DataFrames.
**3. `how`:** Specifies the type of join (Inner, Left, Right, Outer).

πŸ’» Example: Inner Join

import pandas as pd
df_A = pd.DataFrame({'ID': [1, 2], 'Dept': ['HR', 'IT']})
df_B = pd.DataFrame({'ID': [2, 3], 'Salary': [70000, 80000]})

inner_result = pd.merge(df_A, df_B, on='ID', how='inner')
print(inner_result)
# Output: Only ID 2 matches, so only that row is kept.
# ID Dept Salary
# 0 2 IT 70000

πŸ”— Topic 2: Understanding Join Types (`how`)

Choosing the correct join type is crucial as it determines which records are included in the final result and where missing values (`NaN`) are introduced.

πŸ“˜ The Four Merging Styles:

Inner Join (`inner`)

Keeps the **intersection** of keys. Only matched rows from both sides are kept. (Default setting if 'how' is omitted).

Left Join (`left`)

Keeps **all rows from the left** DataFrame. Missing matches on the right side are filled with `NaN`.

Right Join (`right`)

Keeps **all rows from the right** DataFrame. Missing matches on the left side are filled with `NaN`.

Outer Join (`outer`)

Keeps the **union** of keys. All rows from both DataFrames are kept, filling unmatched columns with `NaN`.

πŸ—‚οΈ Topic 3: Index-Based Joining (`df.join()`)

The `DataFrame.join()` method is a simplified tool that is primarily designed for joining data based on the DataFrame's **index**. It's often cleaner and faster for cases where the index is the key.

Key `df.join()` Features:

  • **Join Key:** Defaults to joining the *right* DataFrame to the *left* DataFrame's **index**.
  • **`on` argument:** Allows you to specify a column in the *left* DataFrame to join to the *right* DataFrame's index.
  • **`lsuffix`/`rsuffix`:** Mandatory if non-index/key columns share the same name.

πŸ’» Example: Joining on Index

df_base = pd.DataFrame({'Value': [10, 20, 30]}, index=['A', 'B', 'C'])
df_meta = pd.DataFrame({'Desc': ['Good', 'Bad']}, index=['A', 'B'])

joined_result = df_base.join(df_meta, how='left')
print(joined_result)

↕️ Topic 4: Concatenation (`pd.concat()`)

Concatenation is used when combining DataFrames that share common columns (for `axis=0`) or common indices (for `axis=1`), without performing a relational check for matching key values across rows.

The `axis` Argument Controls Direction:

  • **`axis=0` (Default):** **Rows** are stacked vertically (e.g., combining Q1 data with Q2 data). Columns are aligned by name.
  • **`axis=1`:** **Columns** are lined up horizontally (e.g., adding a new feature). Rows are aligned by index.
  • **`join` argument:** Defaults to `join='outer'` (union of columns/indices). Use `join='inner'` to keep only common columns/indices.

πŸ’» Example: Vertical Concatenation

df_Q1 = pd.DataFrame({'A': [1], 'B': [2]})
df_Q2 = pd.DataFrame({'A': [3], 'B': [4]})

stacked = pd.concat([df_Q1, df_Q2], ignore_index=True)
print(stacked)
# Output:
# A B
# 0 1 2
# 1 3 4

πŸ“š Module Summary

  • `pd.merge()`: Primary method for relational joins on specific **key columns**.
  • `how` Argument: Defines the join logic (`inner`, `left`, `right`, `outer`).
  • `df.join()`: Simplifies joining the right DataFrame to the left DataFrame's **index**.
  • `pd.concat()`: Used for stacking rows (`axis=0`) or columns (`axis=1`), aligning based on index/column names, not keys.
  • Conflicting Names: `pd.merge()` adds `_x` and `_y` suffixes by default for non-key columns with the same name.

πŸ€” Interview Q&A

Tap on the questions below to reveal the answers.

**Merging (`pd.merge`)** combines based on key values (like `CustomerID`) to link related data horizontally. **Concatenation (`pd.concat`)** combines by position, stacking rows or aligning columns side-by-side without checking key relationships.

An **Outer Join** (how='outer') should be used when you need to keep **all** records from both DataFrames, regardless of whether a match exists. This is useful for identifying non-matched records or getting a full census of data.

You use the **`left_on`** and **`right_on`** arguments in `pd.merge()`. For example, if the key is `EmpID` in the left table and `WorkerID` in the right table: pd.merge(df_L, df_R, left_on='EmpID', right_on='WorkerID').

Pandas automatically handles the conflict by appending suffixes, usually _x for the left DataFrame and _y for the right DataFrame (e.g., a column named `Age` becomes `Age_x` and `Age_y`). These suffixes can be customized using the `suffixes` parameter.

πŸ€–
DSPython AI Assistant βœ–
πŸ‘‹ Hi! I’m your AI assistant. Paste your code here, I will find bugs for you.