Merging and Joining DataFrames
Combine and reshape data using keys and indices with Pandas.
π€ Introduction: The Relational Data Model
In Data Science, information is often split across multiple tables to avoid redundancy (Normalization). For instance, customer details are separate from their monthly transactions. To get a complete view, you must link these tables.
**Merging** and **Joining** refer to combining DataFrames horizontally based on common column values (known as **keys**), similar to SQL operations.
**Concatenation** is different; it is primarily used for stacking DataFrames vertically (appending rows) or side-by-side (appending columns) without key-based alignment.
[Image of SQL join venn diagram]π― Topic 1: The Core Merge Function (`pd.merge`)
The `pd.merge()` function is the primary tool for combining DataFrames based on values in shared columns. It offers full control over the relational join operation.
π Key `pd.merge()` Arguments:
on='key_col',
how='inner',
suffixes=('_x', '_y'))
**1. `on`:** Single column name or a list of column names (e.g., `['ID', 'Date']`) to join on.
**2. `left_on`/`right_on`:** Used if the joining column names are different in the two DataFrames.
**3. `how`:** Specifies the type of join (Inner, Left, Right, Outer).
π» Example: Inner Join
import pandas as pddf_A = pd.DataFrame({'ID': [1, 2], 'Dept': ['HR', 'IT']})
df_B = pd.DataFrame({'ID': [2, 3], 'Salary': [70000, 80000]})
inner_result = pd.merge(df_A, df_B, on='ID', how='inner')
print(inner_result)
π Topic 2: Understanding Join Types (`how`)
Choosing the correct join type is crucial as it determines which records are included in the final result and where missing values (`NaN`) are introduced.
π The Four Merging Styles:
Keeps the **intersection** of keys. Only matched rows from both sides are kept. (Default setting if 'how' is omitted).
Keeps **all rows from the left** DataFrame. Missing matches on the right side are filled with `NaN`.
Keeps **all rows from the right** DataFrame. Missing matches on the left side are filled with `NaN`.
Keeps the **union** of keys. All rows from both DataFrames are kept, filling unmatched columns with `NaN`.
ποΈ Topic 3: Index-Based Joining (`df.join()`)
The `DataFrame.join()` method is a simplified tool that is primarily designed for joining data based on the DataFrame's **index**. It's often cleaner and faster for cases where the index is the key.
Key `df.join()` Features:
- **Join Key:** Defaults to joining the *right* DataFrame to the *left* DataFrame's **index**.
- **`on` argument:** Allows you to specify a column in the *left* DataFrame to join to the *right* DataFrame's index.
- **`lsuffix`/`rsuffix`:** Mandatory if non-index/key columns share the same name.
π» Example: Joining on Index
df_base = pd.DataFrame({'Value': [10, 20, 30]}, index=['A', 'B', 'C'])df_meta = pd.DataFrame({'Desc': ['Good', 'Bad']}, index=['A', 'B'])
joined_result = df_base.join(df_meta, how='left')
print(joined_result)
βοΈ Topic 4: Concatenation (`pd.concat()`)
Concatenation is used when combining DataFrames that share common columns (for `axis=0`) or common indices (for `axis=1`), without performing a relational check for matching key values across rows.
The `axis` Argument Controls Direction:
- **`axis=0` (Default):** **Rows** are stacked vertically (e.g., combining Q1 data with Q2 data). Columns are aligned by name.
- **`axis=1`:** **Columns** are lined up horizontally (e.g., adding a new feature). Rows are aligned by index.
- **`join` argument:** Defaults to `join='outer'` (union of columns/indices). Use `join='inner'` to keep only common columns/indices.
π» Example: Vertical Concatenation
df_Q1 = pd.DataFrame({'A': [1], 'B': [2]})df_Q2 = pd.DataFrame({'A': [3], 'B': [4]})
stacked = pd.concat([df_Q1, df_Q2], ignore_index=True)
print(stacked)
π Module Summary
- `pd.merge()`: Primary method for relational joins on specific **key columns**.
- `how` Argument: Defines the join logic (`inner`, `left`, `right`, `outer`).
- `df.join()`: Simplifies joining the right DataFrame to the left DataFrame's **index**.
- `pd.concat()`: Used for stacking rows (`axis=0`) or columns (`axis=1`), aligning based on index/column names, not keys.
- Conflicting Names: `pd.merge()` adds `_x` and `_y` suffixes by default for non-key columns with the same name.
π€ Interview Q&A
Tap on the questions below to reveal the answers.
**Merging (`pd.merge`)** combines based on key values (like `CustomerID`) to link related data horizontally. **Concatenation (`pd.concat`)** combines by position, stacking rows or aligning columns side-by-side without checking key relationships.
An **Outer Join** (how='outer') should be used when you need to keep **all** records from both DataFrames, regardless of whether a match exists. This is useful for identifying non-matched records or getting a full census of data.
You use the **`left_on`** and **`right_on`** arguments in `pd.merge()`. For example, if the key is `EmpID` in the left table and `WorkerID` in the right table: pd.merge(df_L, df_R, left_on='EmpID', right_on='WorkerID').
Pandas automatically handles the conflict by appending suffixes, usually _x for the left DataFrame and _y for the right DataFrame (e.g., a column named `Age` becomes `Age_x` and `Age_y`). These suffixes can be customized using the `suffixes` parameter.