GroupBy Operations

Overview

GroupBy implements the split-apply-combine pattern:

Split: Divide data into groups based on criteria
Apply: Apply a function to each group independently
Combine: Combine results into a data structure

This is one of the most powerful features in pandas.

Basic GroupBy

Single Column Grouping

Group by single column
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'category': ['A', 'B', 'A', 'B', 'A', 'B'],
    'value': [10, 20, 15, 25, 12, 30]
})

# Group by category and sum
result = df.groupby('category')['value'].sum()
# category
# A    37
# B    75

# Group and aggregate entire DataFrame
result = df.groupby('category').sum()
#          value
# category      
# A           37
# B           75

Multiple Column Grouping

Group by multiple columns
df = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B', 'A', 'B'],
    'region': ['East', 'West', 'East', 'West', 'East', 'West'],
    'sales': [100, 150, 200, 250, 120, 280]
})

# Group by category and region
result = df.groupby(['category', 'region'])['sales'].sum()
# category  region
# A         East      220
#           West      150
# B         East      200
#           West      530

# Reset index to get DataFrame
result = result.reset_index()
#   category region  sales
# 0        A   East    220
# 1        A   West    150
# 2        B   East    200
# 3        B   West    530

info

GroupBy with multiple columns creates a MultiIndex. Use reset_index() to convert it back to regular columns.

Common Aggregations

Built-in Aggregations

Standard aggregation methods
df = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B'],
    'value': [10, 20, 15, 25]
})

# Sum
df.groupby('category')['value'].sum()

# Mean
df.groupby('category')['value'].mean()
# category
# A    15.0
# B    20.0

# Count
df.groupby('category')['value'].count()

# Min/Max
df.groupby('category')['value'].min()
df.groupby('category')['value'].max()

# Standard deviation
df.groupby('category')['value'].std()

# Multiple statistics
df.groupby('category')['value'].describe()

agg() - Multiple Aggregations

Apply multiple aggregations
df = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B'],
    'sales': [100, 150, 200, 250],
    'quantity': [10, 15, 20, 25]
})

# Multiple aggregations on one column
result = df.groupby('category')['sales'].agg(['sum', 'mean', 'count'])
#          sum   mean  count
# category                  
# A        250  125.0      2
# B        450  225.0      2

# Different aggregations per column
result = df.groupby('category').agg({
    'sales': ['sum', 'mean'],
    'quantity': ['min', 'max']
})
#          sales       quantity    
#            sum  mean      min max
# category                         
# A          250 125.0       10  15
# B          450 225.0       20  25

# Custom names
result = df.groupby('category').agg(
    total_sales=('sales', 'sum'),
    avg_sales=('sales', 'mean'),
    total_qty=('quantity', 'sum')
)
#          total_sales  avg_sales  total_qty
# category                                   
# A                250      125.0         25
# B                450      225.0         45

success

Use named aggregations (pandas 0.25+) for cleaner column names: agg(name=('column', 'function'))

GroupBy Selection

Select Columns After Grouping

Select specific columns
df = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B'],
    'product': ['X', 'Y', 'X', 'Y'],
    'sales': [100, 150, 200, 250],
    'profit': [20, 30, 40, 50]
})

# Group by category, aggregate sales only
df.groupby('category')['sales'].sum()

# Group by category, aggregate multiple columns
df.groupby('category')[['sales', 'profit']].sum()
#          sales  profit
# category              
# A          250      50
# B          450      90

# All columns except grouping column
df.groupby('category').sum()

Iterate Through Groups

Loop through groups
df = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B'],
    'value': [10, 20, 15, 25]
})

# Iterate through groups
for name, group in df.groupby('category'):
    print(f"Category: {name}")
    print(group)
    print()

# Category: A
#   category  value
# 0        A     10
# 1        A     20
# 
# Category: B
#   category  value
# 2        B     15
# 3        B     25

Get Specific Group

Access a specific group
grouped = df.groupby('category')

# Get group 'A'
group_a = grouped.get_group('A')
#   category  value
# 0        A     10
# 1        A     20

# Get multiple groups
for cat in ['A', 'B']:
    group = grouped.get_group(cat)
    print(f"Group {cat}:")
    print(group)

Transform and Filter

transform() - Same Shape Output

Add aggregated values back to original DataFrame:

Transform adds group statistics
df = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B'],
    'value': [10, 20, 15, 25]
})

# Add group mean to each row
df['group_mean'] = df.groupby('category')['value'].transform('mean')
#   category  value  group_mean
# 0        A     10        15.0
# 1        A     20        15.0
# 2        B     15        20.0
# 3        B     25        20.0

# Add group sum
df['group_sum'] = df.groupby('category')['value'].transform('sum')

# Custom transform
df['diff_from_mean'] = df.groupby('category')['value'].transform(
    lambda x: x - x.mean()
)
#   category  value  diff_from_mean
# 0        A     10            -5.0
# 1        A     20             5.0
# 2        B     15            -5.0
# 3        B     25             5.0

filter() - Keep/Remove Groups

Filter entire groups based on group properties:

Filter groups by condition
df = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B', 'C'],
    'value': [10, 20, 15, 25, 30]
})

# Keep groups where sum > 30
result = df.groupby('category').filter(lambda x: x['value'].sum() > 30)
#   category  value
# 1        A     20
# 0        A     10
# 3        B     25
# 2        B     15

# Keep groups with more than 1 member
result = df.groupby('category').filter(lambda x: len(x) > 1)

# Remove outliers within groups
def remove_outliers(group):
    return group[np.abs(group['value'] - group['value'].mean()) < 2 * group['value'].std()]

result = df.groupby('category').filter(remove_outliers)

warning

filter() returns entire groups (all rows), not individual rows. Use boolean indexing for row-level filtering.

Advanced GroupBy

apply() - Flexible Operations

Most flexible GroupBy method:

Custom operations with apply
df = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B'],
    'value': [10, 20, 15, 25]
})

# Custom function per group
def custom_function(group):
    return pd.Series({
        'sum': group['value'].sum(),
        'count': len(group),
        'max_minus_min': group['value'].max() - group['value'].min()
    })

result = df.groupby('category').apply(custom_function)
#          sum  count  max_minus_min
# category                           
# A         30      2             10
# B         40      2             10

# Return modified DataFrame
def normalize(group):
    group['normalized'] = (group['value'] - group['value'].mean()) / group['value'].std()
    return group

result = df.groupby('category').apply(normalize)

GroupBy with Custom Keys

Group by computed values
df = pd.DataFrame({
    'value': [10, 20, 30, 40, 50]
})

# Group by even/odd
df.groupby(df['value'] % 2 == 0)['value'].sum()
# value
# False    90  # Odd values: 10+30+50
# True     60  # Even values: 20+40

# Group by value ranges
df.groupby(pd.cut(df['value'], bins=[0, 25, 50]))['value'].sum()

# Group by custom function
def classify(value):
    return 'low' if value < 30 else 'high'

df.groupby(df['value'].apply(classify))['value'].mean()

Multiple Aggregations with Different Functions

Complex aggregations
df = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B'],
    'sales': [100, 150, 200, 250],
    'quantity': [10, 15, 20, 25],
    'profit': [20, 30, 40, 50]
})

# Different functions for different columns
result = df.groupby('category').agg({
    'sales': ['sum', 'mean'],
    'quantity': 'sum',
    'profit': lambda x: x.max() - x.min()
})

# With custom names
result = df.groupby('category').agg(
    total_sales=('sales', 'sum'),
    avg_sales=('sales', 'mean'),
    total_qty=('quantity', 'sum'),
    profit_range=('profit', lambda x: x.max() - x.min())
)

Handling Missing Values

Groups with NaN

NaN in groupby keys
df = pd.DataFrame({
    'category': ['A', 'B', np.nan, 'A', np.nan],
    'value': [10, 20, 30, 40, 50]
})

# NaN creates its own group by default
df.groupby('category')['value'].sum()
# category
# A      50
# B      20
# NaN    80

# Exclude NaN groups
df.groupby('category', dropna=True)['value'].sum()
# category
# A    50
# B    20

# NaN in values (not keys) are ignored in aggregations
df['value'] = [10, np.nan, 30, 40, 50]
df.groupby('category')['value'].sum()  # Skips NaN in sum

GroupBy with Dates

Grouping by Time Periods

Group by date components
df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=100),
    'value': range(100)
})

# Group by month
df['month'] = df['date'].dt.month
df.groupby('month')['value'].sum()

# Group by year-month
df.groupby(df['date'].dt.to_period('M'))['value'].sum()

# Group by day of week
df.groupby(df['date'].dt.day_name())['value'].mean()

# Group by custom period
df.groupby(pd.Grouper(key='date', freq='W'))['value'].sum()

Resample vs GroupBy

Resample for time series
df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=30, freq='D'),
    'value': range(30)
})
df = df.set_index('date')

# Resample (better for time series)
df.resample('W')['value'].sum()

# GroupBy with Grouper (more flexible)
df.groupby(pd.Grouper(freq='W'))['value'].sum()

Top N per Group

Get Top N in Each Group

Top N rows per group
df = pd.DataFrame({
    'category': ['A', 'A', 'A', 'B', 'B', 'B'],
    'value': [10, 20, 15, 25, 30, 22]
})

# Top 2 per category
result = df.groupby('category').apply(
    lambda x: x.nlargest(2, 'value')
).reset_index(drop=True)
#   category  value
# 0        A     20
# 1        A     15
# 2        B     30
# 3        B     25

# Using head() for first N
result = df.sort_values('value', ascending=False).groupby('category').head(2)

# Bottom 1 per group
result = df.groupby('category').apply(
    lambda x: x.nsmallest(1, 'value')
)

Cumulative Operations within Groups

Cumulative calculations per group
df = pd.DataFrame({
    'category': ['A', 'A', 'A', 'B', 'B', 'B'],
    'value': [10, 20, 15, 25, 30, 22]
})

# Cumulative sum within each group
df['cumsum'] = df.groupby('category')['value'].cumsum()
#   category  value  cumsum
# 0        A     10      10
# 1        A     20      30
# 2        A     15      45
# 3        B     25      25
# 4        B     30      55
# 5        B     22      77

# Rank within group
df['rank'] = df.groupby('category')['value'].rank(ascending=False)

# Percentage of group total
df['pct_of_group'] = df.groupby('category')['value'].transform(
    lambda x: x / x.sum() * 100
)

Performance Tips

Optimize GroupBy Operations

Performance optimization
df = pd.DataFrame({
    'category': ['A'] * 10000 + ['B'] * 10000,
    'value': range(20000)
})

# Slow: Multiple separate groupby calls
sum_a = df[df['category'] == 'A']['value'].sum()
sum_b = df[df['category'] == 'B']['value'].sum()

# Fast: Single groupby
result = df.groupby('category')['value'].sum()

# Use as_index=False to avoid reset_index
result = df.groupby('category', as_index=False)['value'].sum()
#   category      value
# 0        A   49995000
# 1        B  149995000

# Faster aggregations with numba (if installed)
result = df.groupby('category')['value'].agg('sum', engine='numba')

success

Avoid multiple groupby calls. Combine all aggregations in a single agg() call for better performance.

Sort Parameter

Control sorting in groupby
df = pd.DataFrame({
    'category': ['B', 'A', 'C', 'B', 'A', 'C'],
    'value': [10, 20, 15, 25, 30, 22]
})

# GroupBy sorts by default (can be slow)
df.groupby('category')['value'].sum()
# category
# A    50
# B    35
# C    37

# Disable sorting for speed (if order doesn't matter)
df.groupby('category', sort=False)['value'].sum()
# category
# B    35  # First category in data
# A    50
# C    37

Common Patterns

Percentage of Total per Group

Calculate group percentages
df = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B'],
    'product': ['X', 'Y', 'X', 'Y'],
    'sales': [100, 150, 200, 250]
})

# Sales as % of category total
df['pct_of_category'] = df.groupby('category')['sales'].transform(
    lambda x: x / x.sum() * 100
)
#   category product  sales  pct_of_category
# 0        A       X    100             40.0
# 1        A       Y    150             60.0
# 2        B       X    200             44.44
# 3        B       Y    250             55.56

# Sales as % of grand total
df['pct_of_total'] = df['sales'] / df['sales'].sum() * 100

Compare to Group Average

Deviation from group mean
df = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B'],
    'value': [10, 20, 15, 25]
})

# Difference from group mean
df['diff_from_mean'] = df.groupby('category')['value'].transform(
    lambda x: x - x.mean()
)

# Standardize within groups (z-score)
df['z_score'] = df.groupby('category')['value'].transform(
    lambda x: (x - x.mean()) / x.std()
)

Fill Missing with Group Mean

Impute missing values by group
df = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B', 'A'],
    'value': [10, np.nan, 15, 25, 20]
})

# Fill NaN with group mean
df['value'] = df.groupby('category')['value'].transform(
    lambda x: x.fillna(x.mean())
)
#   category  value
# 0        A   10.0
# 1        A   15.0  # Filled with A's mean (10+20)/2
# 2        B   15.0
# 3        B   25.0
# 4        A   20.0

Common Mistakes

Forgetting reset_index()

GroupBy creates index from groups
df = pd.DataFrame({
    'category': ['A', 'B', 'A', 'B'],
    'value': [10, 20, 15, 25]
})

# Result has category as index
result = df.groupby('category')['value'].sum()
# category
# A    25
# B    45

# Need reset_index for regular DataFrame
result = df.groupby('category')['value'].sum().reset_index()
#   category  value
# 0        A     25
# 1        B     45

# Or use as_index=False
result = df.groupby('category', as_index=False)['value'].sum()

transform() vs agg() Confusion

Understanding transform vs agg
df = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B'],
    'value': [10, 20, 15, 25]
})

# agg() reduces to one row per group
df.groupby('category')['value'].agg('mean')
# category
# A    15.0
# B    20.0

# transform() keeps original shape
df.groupby('category')['value'].transform('mean')
# 0    15.0  # Mean of A
# 1    15.0  # Mean of A
# 2    20.0  # Mean of B
# 3    20.0  # Mean of B

Column Name Conflicts

Avoid naming conflicts
df = pd.DataFrame({
    'category': ['A', 'B', 'A', 'B'],
    'sum': [10, 20, 15, 25]  # 'sum' is also a method name
})

# This can be confusing
df.groupby('category')['sum'].sum()  # Works but confusing

# Better: use clear column names
df.rename(columns={'sum': 'total'}, inplace=True)
df.groupby('category')['total'].sum()

Quick Reference

Basic groupby:

df.groupby('col')['value'].sum()
df.groupby(['col1', 'col2'])['value'].mean()
df.groupby('col', as_index=False)['value'].sum()

Aggregations:

grouped.agg(['sum', 'mean', 'count'])
grouped.agg({'col1': 'sum', 'col2': 'mean'})
grouped.agg(total=('sales', 'sum'))

Transform and filter:

grouped.transform('mean')           # Add group stat to each row
grouped.filter(lambda x: len(x) > 2)  # Keep groups with >2 rows
grouped.apply(custom_func)          # Custom operation

Common patterns:

# Top N per group
df.groupby('group').head(n)
df.groupby('group').nlargest(n, 'col')

# Cumulative within group
df.groupby('group')['col'].cumsum()

# Percentage of group
df.groupby('group')['col'].transform(lambda x: x / x.sum() * 100)

# Fill NaN with group mean
df.groupby('group')['col'].transform(lambda x: x.fillna(x.mean()))

# Compare to group mean
df.groupby('group')['col'].transform(lambda x: x - x.mean())

Performance:

df.groupby('col', sort=False)       # Disable sorting
df.groupby('col').agg(['sum', 'mean'])  # Combine aggregations

Overview​

Basic GroupBy​

Single Column Grouping​

Multiple Column Grouping​

Common Aggregations​

Built-in Aggregations​

agg() - Multiple Aggregations​

GroupBy Selection​

Select Columns After Grouping​

Iterate Through Groups​

Get Specific Group​

Transform and Filter​

transform() - Same Shape Output​

filter() - Keep/Remove Groups​

Advanced GroupBy​

apply() - Flexible Operations​

GroupBy with Custom Keys​

Multiple Aggregations with Different Functions​

Handling Missing Values​

Groups with NaN​

GroupBy with Dates​

Grouping by Time Periods​

Resample vs GroupBy​

Top N per Group​

Get Top N in Each Group​

Cumulative Operations within Groups​

Performance Tips​

Optimize GroupBy Operations​

Sort Parameter​

Common Patterns​

Percentage of Total per Group​

Compare to Group Average​

Fill Missing with Group Mean​

Common Mistakes​

Forgetting reset_index()​

transform() vs agg() Confusion​

Column Name Conflicts​

Quick Reference​