Skip to main content

Merge, Join, and Concat

Overview

pandas provides several methods to combine DataFrames:

  • merge(): SQL-style joins based on common columns/indices
  • join(): Join on indices (or column to index)
  • concat(): Stack DataFrames vertically or horizontally
  • append(): Add rows (deprecated, use concat)

merge() - SQL-Style Joins

Inner Join (Default)

Keep only rows with matching keys in both DataFrames:

Inner join - intersection
import pandas as pd

left = pd.DataFrame({
'key': ['A', 'B', 'C'],
'left_value': [1, 2, 3]
})

right = pd.DataFrame({
'key': ['A', 'B', 'D'],
'right_value': [4, 5, 6]
})

# Inner join (only A and B match)
result = pd.merge(left, right, on='key', how='inner')
# key left_value right_value
# 0 A 1 4
# 1 B 2 5

# Or use DataFrame method
result = left.merge(right, on='key', how='inner')
info

Inner join keeps only matching rows. This is the default behavior when how is not specified.

Left Join

Keep all rows from left DataFrame:

Left join - keep all left rows
result = pd.merge(left, right, on='key', how='left')
# key left_value right_value
# 0 A 1 4.0
# 1 B 2 5.0
# 2 C 3 NaN # No match in right, filled with NaN

Right Join

Keep all rows from right DataFrame:

Right join - keep all right rows
result = pd.merge(left, right, on='key', how='right')
# key left_value right_value
# 0 A 1.0 4
# 1 B 2.0 5
# 2 D NaN 6 # No match in left

Outer Join

Keep all rows from both DataFrames:

Outer join - union
result = pd.merge(left, right, on='key', how='outer')
# key left_value right_value
# 0 A 1.0 4.0
# 1 B 2.0 5.0
# 2 C 3.0 NaN
# 3 D NaN 6.0

Join Type Comparison

All join types visualized
left = pd.DataFrame({
'key': ['A', 'B', 'C'],
'val': [1, 2, 3]
})
right = pd.DataFrame({
'key': ['B', 'C', 'D'],
'val': [4, 5, 6]
})

# Inner: B, C (intersection)
# Left: A, B, C (all from left)
# Right: B, C, D (all from right)
# Outer: A, B, C, D (union)
warning

Be careful with join types:

  • Inner: May lose data (only matches)
  • Left: Preserves left data, adds right where possible
  • Right: Preserves right data, adds left where possible
  • Outer: Keeps everything, but creates many NaN values

Merge on Different Columns

Different Column Names

Merge when columns have different names
left = pd.DataFrame({
'employee_id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie']
})

right = pd.DataFrame({
'emp_id': [1, 2, 4],
'salary': [50000, 60000, 55000]
})

# Specify which columns to join on
result = pd.merge(
left,
right,
left_on='employee_id',
right_on='emp_id',
how='inner'
)
# employee_id name emp_id salary
# 0 1 Alice 1 50000
# 1 2 Bob 2 60000

Multiple Columns

Merge on multiple columns
left = pd.DataFrame({
'year': [2023, 2023, 2024],
'quarter': ['Q1', 'Q2', 'Q1'],
'sales': [100, 120, 110]
})

right = pd.DataFrame({
'year': [2023, 2023, 2024],
'quarter': ['Q1', 'Q2', 'Q2'],
'costs': [80, 90, 85]
})

# Join on both year and quarter
result = pd.merge(left, right, on=['year', 'quarter'], how='inner')
# year quarter sales costs
# 0 2023 Q1 100 80
# 1 2023 Q2 120 90

Merge with Index

Merge on index
left = pd.DataFrame({
'value': [1, 2, 3]
}, index=['A', 'B', 'C'])

right = pd.DataFrame({
'value': [4, 5, 6]
}, index=['A', 'B', 'D'])

# Merge on indices
result = pd.merge(
left,
right,
left_index=True,
right_index=True,
how='inner'
)
# value_x value_y
# A 1 4
# B 2 5

# Or use suffixes to rename columns
result = pd.merge(
left,
right,
left_index=True,
right_index=True,
how='inner',
suffixes=('_left', '_right')
)
# value_left value_right
# A 1 4
# B 2 5

Handling Duplicate Keys

One-to-Many Merge

One-to-many relationship
customers = pd.DataFrame({
'customer_id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie']
})

orders = pd.DataFrame({
'customer_id': [1, 1, 2, 3, 3],
'order_id': [101, 102, 103, 104, 105],
'amount': [100, 150, 200, 120, 180]
})

# One customer can have many orders
result = pd.merge(customers, orders, on='customer_id', how='left')
# customer_id name order_id amount
# 0 1 Alice 101.0 100.0
# 1 1 Alice 102.0 150.0
# 2 2 Bob 103.0 200.0
# 3 3 Charlie 104.0 120.0
# 4 3 Charlie 105.0 180.0

Many-to-Many Merge

Many-to-many creates cartesian product
students = pd.DataFrame({
'class': ['Math', 'Math', 'Science'],
'student': ['Alice', 'Bob', 'Charlie']
})

classes = pd.DataFrame({
'class': ['Math', 'Math', 'Science'],
'room': ['101', '102', '201']
})

# Each Math student matches both Math rooms (2x2 = 4 rows)
result = pd.merge(students, classes, on='class')
# class student room
# 0 Math Alice 101
# 1 Math Alice 102
# 2 Math Bob 101
# 3 Math Bob 102
# 4 Science Charlie 201
warning

Many-to-many merges can explode the number of rows. Check if this is expected or indicates a data issue.

Indicator Column

Track which DataFrame each row came from:

Add merge indicator
left = pd.DataFrame({'key': ['A', 'B', 'C'], 'val': [1, 2, 3]})
right = pd.DataFrame({'key': ['B', 'C', 'D'], 'val': [4, 5, 6]})

result = pd.merge(left, right, on='key', how='outer', indicator=True)
# key val_x val_y _merge
# 0 A 1.0 NaN left_only
# 1 B 2.0 4.0 both
# 2 C 3.0 5.0 both
# 3 D NaN 6.0 right_only

# Custom indicator name
result = pd.merge(left, right, on='key', how='outer', indicator='source')

join() Method

Simpler syntax for index-based joins:

Join on indices
left = pd.DataFrame({
'A': [1, 2, 3]
}, index=['a', 'b', 'c'])

right = pd.DataFrame({
'B': [4, 5, 6]
}, index=['a', 'b', 'd'])

# Join (left join by default)
result = left.join(right, how='inner')
# A B
# a 1 4
# b 2 5

# Outer join
result = left.join(right, how='outer')
# A B
# a 1.0 4.0
# b 2.0 5.0
# c 3.0 NaN
# d NaN 6.0

Join Multiple DataFrames

Join multiple DataFrames at once
df1 = pd.DataFrame({'A': [1, 2]}, index=['a', 'b'])
df2 = pd.DataFrame({'B': [3, 4]}, index=['a', 'b'])
df3 = pd.DataFrame({'C': [5, 6]}, index=['a', 'b'])

# Join all at once
result = df1.join([df2, df3])
# A B C
# a 1 3 5
# b 2 4 6

Join Column to Index

Join DataFrame column to another's index
left = pd.DataFrame({
'key': ['a', 'b', 'c'],
'value': [1, 2, 3]
})

right = pd.DataFrame({
'data': [4, 5, 6]
}, index=['a', 'b', 'd'])

# Join left's 'key' column to right's index
result = left.join(right, on='key')
# key value data
# 0 a 1 4.0
# 1 b 2 5.0
# 2 c 3 NaN

concat() - Concatenation

Vertical Concatenation (Stack Rows)

Concatenate along rows (axis=0)
df1 = pd.DataFrame({
'A': [1, 2],
'B': [3, 4]
})

df2 = pd.DataFrame({
'A': [5, 6],
'B': [7, 8]
})

# Stack vertically (default axis=0)
result = pd.concat([df1, df2])
# A B
# 0 1 3
# 1 2 4
# 0 5 7 # Index repeats!
# 1 6 8

# Reset index
result = pd.concat([df1, df2], ignore_index=True)
# A B
# 0 1 3
# 1 2 4
# 2 5 7
# 3 6 8

Horizontal Concatenation (Stack Columns)

Concatenate along columns (axis=1)
df1 = pd.DataFrame({
'A': [1, 2, 3]
})

df2 = pd.DataFrame({
'B': [4, 5, 6]
})

# Stack horizontally
result = pd.concat([df1, df2], axis=1)
# A B
# 0 1 4
# 1 2 5
# 2 3 6

Concat with Different Columns

Handle mismatched columns
df1 = pd.DataFrame({
'A': [1, 2],
'B': [3, 4]
})

df2 = pd.DataFrame({
'B': [5, 6],
'C': [7, 8]
})

# Union of columns (default)
result = pd.concat([df1, df2], ignore_index=True)
# A B C
# 0 1.0 3 NaN
# 1 2.0 4 NaN
# 2 NaN 5 7.0
# 3 NaN 6 8.0

# Intersection of columns only
result = pd.concat([df1, df2], join='inner', ignore_index=True)
# B
# 0 3
# 1 4
# 2 5
# 3 6

Concat with Keys

Add multi-level index to track source:

Track source DataFrame with keys
df1 = pd.DataFrame({'value': [1, 2]})
df2 = pd.DataFrame({'value': [3, 4]})

result = pd.concat([df1, df2], keys=['df1', 'df2'])
# value
# df1 0 1
# 1 2
# df2 0 3
# 1 4

# Reset to make keys a column
result = result.reset_index(level=0).rename(columns={'level_0': 'source'})
# source value
# 0 df1 1
# 1 df1 2
# 0 df2 3
# 1 df2 4

Combining Operations

Merge Multiple DataFrames

Chain multiple merges
df1 = pd.DataFrame({'key': ['A', 'B'], 'val1': [1, 2]})
df2 = pd.DataFrame({'key': ['A', 'B'], 'val2': [3, 4]})
df3 = pd.DataFrame({'key': ['A', 'B'], 'val3': [5, 6]})

# Chain merges
result = (df1
.merge(df2, on='key')
.merge(df3, on='key')
)
# key val1 val2 val3
# 0 A 1 3 5
# 1 B 2 4 6

# Or use reduce
from functools import reduce

dfs = [df1, df2, df3]
result = reduce(lambda left, right: pd.merge(left, right, on='key'), dfs)

Merge Then Concat

Combine merge and concat
# Merge data for 2023
df_2023_sales = pd.DataFrame({'product': ['A', 'B'], 'sales': [100, 200]})
df_2023_costs = pd.DataFrame({'product': ['A', 'B'], 'costs': [80, 150]})
df_2023 = pd.merge(df_2023_sales, df_2023_costs, on='product')
df_2023['year'] = 2023

# Merge data for 2024
df_2024_sales = pd.DataFrame({'product': ['A', 'B'], 'sales': [120, 220]})
df_2024_costs = pd.DataFrame({'product': ['A', 'B'], 'costs': [85, 160]})
df_2024 = pd.merge(df_2024_sales, df_2024_costs, on='product')
df_2024['year'] = 2024

# Concat both years
result = pd.concat([df_2023, df_2024], ignore_index=True)

Common Patterns

Add Lookup Data

Enrich data with lookup table
transactions = pd.DataFrame({
'product_id': [1, 2, 1, 3],
'quantity': [10, 5, 8, 12]
})

product_info = pd.DataFrame({
'product_id': [1, 2, 3],
'name': ['Widget', 'Gadget', 'Tool'],
'price': [10.0, 20.0, 15.0]
})

# Add product info to transactions
result = pd.merge(transactions, product_info, on='product_id', how='left')
result['total'] = result['quantity'] * result['price']
# product_id quantity name price total
# 0 1 10 Widget 10.0 100.0
# 1 2 5 Gadget 20.0 100.0
# 2 1 8 Widget 10.0 80.0
# 3 3 12 Tool 15.0 180.0

Combine Regional Data

Combine data from multiple sources
north = pd.DataFrame({
'product': ['A', 'B'],
'sales': [100, 200],
'region': 'North'
})

south = pd.DataFrame({
'product': ['A', 'B'],
'sales': [150, 180],
'region': 'South'
})

# Combine all regions
all_regions = pd.concat([north, south], ignore_index=True)
# product sales region
# 0 A 100 North
# 1 B 200 North
# 2 A 150 South
# 3 B 180 South

Update DataFrame

Update values from another DataFrame
main = pd.DataFrame({
'id': [1, 2, 3],
'value': [10, 20, 30]
})

updates = pd.DataFrame({
'id': [2, 3],
'value': [25, 35] # New values
})

# Update by merging and combining
result = main.merge(updates, on='id', how='left', suffixes=('', '_new'))
result['value'] = result['value_new'].fillna(result['value'])
result = result.drop('value_new', axis=1)
# id value
# 0 1 10.0
# 1 2 25.0 # Updated
# 2 3 35.0 # Updated

Performance Tips

Merge Performance

Optimize merge operations
# Index merge is faster
left = df1.set_index('key')
right = df2.set_index('key')
result = left.join(right) # Faster than merge on column

# Sort before merge for large DataFrames
left = left.sort_values('key')
right = right.sort_values('key')
result = pd.merge(left, right, on='key')

# Use categorical for repeated merge keys
df['key'] = df['key'].astype('category')
success

For repeated merges on the same column, convert it to categorical dtype or set it as the index for better performance.

Common Mistakes

Duplicate Column Names

Handle duplicate column names
left = pd.DataFrame({'key': ['A'], 'value': [1]})
right = pd.DataFrame({'key': ['A'], 'value': [2]})

# Creates value_x and value_y (confusing!)
result = pd.merge(left, right, on='key')
# key value_x value_y
# 0 A 1 2

# Better: use descriptive suffixes
result = pd.merge(left, right, on='key', suffixes=('_left', '_right'))
# key value_left value_right
# 0 A 1 2

# Or rename before merge
left = left.rename(columns={'value': 'left_value'})
right = right.rename(columns={'value': 'right_value'})
result = pd.merge(left, right, on='key')

Forgetting ignore_index

Reset index after concat
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'A': [3, 4]})

# Index repeats (confusing)
result = pd.concat([df1, df2])
# A
# 0 1
# 1 2
# 0 3 # Index 0 again!
# 1 4

# Better: reset index
result = pd.concat([df1, df2], ignore_index=True)
# A
# 0 1
# 1 2
# 2 3
# 3 4

Wrong Join Type

Choose correct join type
left = pd.DataFrame({'key': ['A', 'B', 'C'], 'val': [1, 2, 3]})
right = pd.DataFrame({'key': ['A', 'B'], 'val': [4, 5]})

# Inner join loses 'C'
inner = pd.merge(left, right, on='key', how='inner') # Only A, B

# Left join keeps all left data
left_join = pd.merge(left, right, on='key', how='left') # A, B, C

Quick Reference

Merge (SQL-style joins):

pd.merge(left, right, on='key', how='inner')  # Inner join
pd.merge(left, right, on='key', how='left') # Left join
pd.merge(left, right, on='key', how='right') # Right join
pd.merge(left, right, on='key', how='outer') # Outer join
pd.merge(left, right, left_on='k1', right_on='k2') # Different names
pd.merge(left, right, left_index=True, right_index=True) # On index

Join (index-based):

left.join(right, how='inner')
left.join(right, how='left')
left.join([df2, df3]) # Join multiple
left.join(right, on='key') # Column to index

Concat (stack):

pd.concat([df1, df2])  # Stack rows (axis=0)
pd.concat([df1, df2], axis=1) # Stack columns
pd.concat([df1, df2], ignore_index=True) # Reset index
pd.concat([df1, df2], keys=['a', 'b']) # Add multi-level index
pd.concat([df1, df2], join='inner') # Only common columns

Common patterns:

# Merge multiple DataFrames
df1.merge(df2, on='key').merge(df3, on='key')

# Add lookup data
transactions.merge(product_info, on='product_id', how='left')

# Combine and track source
pd.concat([df1, df2], keys=['source1', 'source2'])

# Update values
main.merge(updates, on='id', how='left', suffixes=('', '_new'))

Tips:

  • Use merge() for column-based joins
  • Use join() for index-based joins
  • Use concat() for stacking DataFrames
  • Always check result size to catch unexpected many-to-many joins
  • Use indicator=True to track merge sources