Interview Snippets

Overview

Common pandas questions asked in data science and analytics interviews. Each snippet includes the problem, solution, and explanation.

Data Manipulation

Find Duplicate Rows

Question: Find all duplicate rows in a DataFrame.

Find duplicates
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Alice', 'Charlie'],
    'age': [25, 30, 25, 35]
})

# Find duplicates
duplicates = df[df.duplicated(keep=False)]
#     name  age
# 0  Alice   25
# 2  Alice   25

# Find first duplicate only
duplicates = df[df.duplicated(keep='first')]
#     name  age
# 2  Alice   25

# Count duplicates per combination
dup_counts = df.groupby(df.columns.tolist()).size().reset_index(name='count')
dup_counts = dup_counts[dup_counts['count'] > 1]

Top N per Group

Question: Get top 3 highest values per category.

Top N per group
df = pd.DataFrame({
    'category': ['A', 'A', 'A', 'B', 'B', 'B', 'A', 'B'],
    'value': [10, 20, 15, 25, 30, 22, 18, 28]
})

# Method 1: Using groupby + nlargest
top3 = df.groupby('category').apply(
    lambda x: x.nlargest(3, 'value')
).reset_index(drop=True)

# Method 2: Using sort + groupby + head
top3 = (df.sort_values('value', ascending=False)
          .groupby('category')
          .head(3))

# Method 3: Using rank
df['rank'] = df.groupby('category')['value'].rank(ascending=False)
top3 = df[df['rank'] <= 3]

Second Highest Value

Question: Find the second highest salary per department.

Second highest value
df = pd.DataFrame({
    'department': ['Sales', 'Sales', 'IT', 'IT', 'Sales'],
    'salary': [50000, 60000, 70000, 65000, 55000]
})

# Method 1: Using nlargest
second_highest = df.groupby('department')['salary'].nlargest(2).groupby('department').last()

# Method 2: Using rank
df['rank'] = df.groupby('department')['salary'].rank(ascending=False, method='dense')
second_highest = df[df['rank'] == 2]

# Method 3: Using sort
second_highest = (df.sort_values('salary', ascending=False)
                    .groupby('department')
                    .nth(1))

Pivot with Aggregation

Question: Create a pivot table showing average sales by product and region.

Pivot table with aggregation
df = pd.DataFrame({
    'product': ['A', 'B', 'A', 'B', 'A', 'B'],
    'region': ['East', 'East', 'West', 'West', 'East', 'West'],
    'sales': [100, 150, 120, 180, 110, 170]
})

# Pivot table
result = df.pivot_table(
    index='product',
    columns='region',
    values='sales',
    aggfunc='mean'
)
# region   East  West
# product            
# A       105.0   120
# B       150.0   175

# With multiple aggregations
result = df.pivot_table(
    index='product',
    columns='region',
    values='sales',
    aggfunc=['mean', 'sum', 'count']
)

String Operations

Extract from String

Question: Extract domain from email addresses.

Extract domain from email
df = pd.DataFrame({
    'email': ['alice@gmail.com', 'bob@yahoo.com', 'charlie@gmail.com']
})

# Extract domain
df['domain'] = df['email'].str.split('@').str[1]
# Or using extract
df['domain'] = df['email'].str.extract(r'@(.+)')

#              email      domain
# 0  alice@gmail.com   gmail.com
# 1   bob@yahoo.com   yahoo.com
# 2  charlie@gmail.com  gmail.com

Clean Phone Numbers

Question: Standardize phone numbers to XXX-XXX-XXXX format.

Standardize phone numbers
df = pd.DataFrame({
    'phone': ['(123) 456-7890', '555.123.4567', '9998887777']
})

# Remove all non-digits
df['cleaned'] = df['phone'].str.replace(r'\D', '', regex=True)

# Format as XXX-XXX-XXXX
df['formatted'] = df['cleaned'].str.replace(
    r'(\d{3})(\d{3})(\d{4})',
    r'\1-\2-\3',
    regex=True
)
#             phone     cleaned     formatted
# 0  (123) 456-7890  1234567890  123-456-7890
# 1   555.123.4567  5551234567  555-123-4567
# 2      9998887777  9998887777  999-888-7777

Missing Data

Fill Missing with Group Mean

Question: Fill missing values with the mean of their group.

Fill NaN with group mean
df = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B', 'A'],
    'value': [10, np.nan, 15, 25, 20]
})

# Fill with group mean
df['value'] = df.groupby('category')['value'].transform(
    lambda x: x.fillna(x.mean())
)
#   category  value
# 0        A   10.0
# 1        A   15.0  # Filled with mean of A: (10+20)/2
# 2        B   15.0
# 3        B   25.0
# 4        A   20.0

Forward Fill by Group

Question: Forward fill missing values within each group.

Forward fill within groups
df = pd.DataFrame({
    'id': [1, 1, 1, 2, 2, 2],
    'value': [10, np.nan, np.nan, 20, np.nan, 30]
})

# Forward fill per group
df['filled'] = df.groupby('id')['value'].ffill()
#    id  value  filled
# 0   1   10.0    10.0
# 1   1    NaN    10.0
# 2   1    NaN    10.0
# 3   2   20.0    20.0
# 4   2    NaN    20.0
# 5   2   30.0    30.0

DateTime Operations

Calculate Age from Birthdate

Question: Calculate age from birthdate.

Calculate age
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'birthdate': pd.to_datetime(['1990-05-15', '1985-08-20', '1995-12-10'])
})

today = pd.Timestamp('2024-01-15')
df['age'] = ((today - df['birthdate']).dt.days / 365.25).astype(int)
#       name  birthdate  age
# 0    Alice 1990-05-15   33
# 1      Bob 1985-08-20   38
# 2  Charlie 1995-12-10   28

Business Days Between Dates

Question: Calculate business days between two dates.

Count business days
df = pd.DataFrame({
    'start': pd.to_datetime(['2024-01-01', '2024-02-01']),
    'end': pd.to_datetime(['2024-01-15', '2024-02-15'])
})

# Count business days
df['business_days'] = df.apply(
    lambda row: len(pd.bdate_range(row['start'], row['end'])),
    axis=1
)
#        start        end  business_days
# 0 2024-01-01 2024-01-15             11
# 1 2024-02-01 2024-02-15             11

Filter Last N Days

Question: Get records from the last 30 days.

Filter recent records
df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=100),
    'value': range(100)
})

# Last 30 days from most recent date
end_date = df['date'].max()
start_date = end_date - pd.Timedelta(days=30)
recent = df[df['date'] >= start_date]

# Alternative
recent = df[df['date'] >= df['date'].max() - pd.Timedelta(days=30)]

Aggregation & GroupBy

Running Total by Group

Question: Calculate cumulative sum per category.

Cumulative sum per group
df = pd.DataFrame({
    'category': ['A', 'A', 'A', 'B', 'B', 'B'],
    'value': [10, 20, 15, 25, 30, 22]
})

df['cumsum'] = df.groupby('category')['value'].cumsum()
#   category  value  cumsum
# 0        A     10      10
# 1        A     20      30
# 2        A     15      45
# 3        B     25      25
# 4        B     30      55
# 5        B     22      77

Percentage of Total per Group

Question: Calculate each value as percentage of group total.

Percentage within groups
df = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B'],
    'value': [100, 150, 200, 250]
})

df['pct_of_group'] = df.groupby('category')['value'].transform(
    lambda x: x / x.sum() * 100
)
#   category  value  pct_of_group
# 0        A    100          40.0
# 1        A    150          60.0
# 2        B    200          44.44
# 3        B    250          55.56

Rank Within Groups

Question: Rank values within each group.

Rank within groups
df = pd.DataFrame({
    'department': ['Sales', 'Sales', 'IT', 'IT', 'Sales'],
    'salary': [50000, 60000, 70000, 65000, 55000]
})

df['rank'] = df.groupby('department')['salary'].rank(ascending=False)
#   department  salary  rank
# 0      Sales   50000   3.0
# 1      Sales   60000   1.0
# 2         IT   70000   1.0
# 3         IT   65000   2.0
# 4      Sales   55000   2.0

Merging & Joining

Find Unmatched Records

Question: Find records in left DataFrame that don't have a match in right.

Find unmatched records
left = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Alice', 'Bob', 'Charlie', 'David']
})

right = pd.DataFrame({
    'id': [1, 2, 5],
    'value': [100, 200, 300]
})

# Find unmatched
merged = pd.merge(left, right, on='id', how='left', indicator=True)
unmatched = merged[merged['_merge'] == 'left_only']
#    id     name  value      _merge
# 2   3  Charlie    NaN   left_only
# 3   4    David    NaN   left_only

Merge Multiple DataFrames

Question: Merge three DataFrames on a common key.

Merge multiple DataFrames
df1 = pd.DataFrame({'id': [1, 2], 'a': [10, 20]})
df2 = pd.DataFrame({'id': [1, 2], 'b': [30, 40]})
df3 = pd.DataFrame({'id': [1, 2], 'c': [50, 60]})

# Method 1: Chain merges
result = df1.merge(df2, on='id').merge(df3, on='id')

# Method 2: Using reduce
from functools import reduce
dfs = [df1, df2, df3]
result = reduce(lambda left, right: pd.merge(left, right, on='id'), dfs)
#    id   a   b   c
# 0   1  10  30  50
# 1   2  20  40  60

Performance & Optimization

Optimize Data Types

Question: Reduce memory usage of a DataFrame.

Optimize memory usage
df = pd.DataFrame({
    'category': ['A', 'B', 'A', 'B'] * 1000,
    'value': range(4000)
})

# Check memory
print(df.memory_usage(deep=True))

# Optimize
df['category'] = df['category'].astype('category')
df['value'] = pd.to_numeric(df['value'], downcast='integer')

# Memory reduced significantly
print(df.memory_usage(deep=True))

Vectorized vs Loop

Question: Calculate a new column efficiently.

Vectorization vs loops
df = pd.DataFrame({
    'a': range(10000),
    'b': range(10000)
})

# Bad: Loop (slow)
result = []
for i in range(len(df)):
    result.append(df.loc[i, 'a'] + df.loc[i, 'b'])
df['sum'] = result

# Good: Vectorized (fast)
df['sum'] = df['a'] + df['b']

# Bad: Apply with simple operation
df['sum'] = df.apply(lambda row: row['a'] + row['b'], axis=1)

# Good: Vectorized
df['sum'] = df['a'] + df['b']

Complex Queries

Consecutive Days

Question: Find records with 3 or more consecutive days of data.

Find consecutive sequences
df = pd.DataFrame({
    'date': pd.to_datetime(['2024-01-01', '2024-01-02', '2024-01-03', 
                            '2024-01-05', '2024-01-06', '2024-01-10'])
})

# Calculate day differences
df['day_diff'] = df['date'].diff().dt.days

# Mark new sequences
df['new_seq'] = (df['day_diff'] != 1) | (df['day_diff'].isna())
df['seq_id'] = df['new_seq'].cumsum()

# Count sequence lengths
seq_lengths = df.groupby('seq_id').size()

# Filter sequences with 3+ days
long_sequences = seq_lengths[seq_lengths >= 3]
result = df[df['seq_id'].isin(long_sequences.index)]

Remove Outliers by Group

Question: Remove outliers using IQR method per category.

Remove outliers per group
df = pd.DataFrame({
    'category': ['A'] * 10 + ['B'] * 10,
    'value': [1, 2, 3, 4, 5, 100, 7, 8, 9, 10,  # 100 is outlier in A
              20, 22, 24, 26, 28, 30, 32, 34, 200, 38]  # 200 is outlier in B
})

def remove_outliers(group):
    Q1 = group['value'].quantile(0.25)
    Q3 = group['value'].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return group[(group['value'] >= lower) & (group['value'] <= upper)]

cleaned = df.groupby('category').apply(remove_outliers).reset_index(drop=True)

Transpose with Aggregation

Question: Reshape data from long to wide with multiple value columns.

Complex pivot
df = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
    'metric': ['sales', 'profit', 'sales', 'profit'],
    'value': [100, 20, 120, 25]
})

# Pivot to wide format
wide = df.pivot(index='date', columns='metric', values='value')
# metric      profit  sales
# date                     
# 2024-01-01      20    100
# 2024-01-02      25    120

# Calculate margin
wide['margin'] = (wide['profit'] / wide['sales'] * 100).round(2)

SQL-Like Operations

Window Functions

Question: Calculate difference from previous row per group.

Lag/Lead calculations
df = pd.DataFrame({
    'id': [1, 1, 1, 2, 2, 2],
    'date': pd.date_range('2024-01-01', periods=6),
    'value': [10, 15, 20, 30, 35, 40]
})

# Lag (previous value)
df['prev_value'] = df.groupby('id')['value'].shift(1)

# Calculate change
df['change'] = df['value'] - df['prev_value']

# Percentage change
df['pct_change'] = df.groupby('id')['value'].pct_change() * 100

Self Join

Question: Find pairs of records that meet a condition.

Self join pattern
df = pd.DataFrame({
    'employee': ['Alice', 'Bob', 'Charlie'],
    'manager': ['Bob', 'Charlie', None]
})

# Find employee-manager pairs
result = pd.merge(
    df,
    df,
    left_on='manager',
    right_on='employee',
    suffixes=('_emp', '_mgr')
)
#   employee_emp manager_emp employee_mgr manager_mgr
# 0        Alice         Bob          Bob     Charlie
# 1          Bob     Charlie      Charlie        None

Quick Tips

Common mistakes to avoid:

# Don't: Chained indexing
df[df['col'] > 5]['new_col'] = 10  # Warning!

# Do: Use loc
df.loc[df['col'] > 5, 'new_col'] = 10

# Don't: Loop through rows
for i in range(len(df)):
    df.loc[i, 'new'] = df.loc[i, 'a'] + df.loc[i, 'b']

# Do: Vectorize
df['new'] = df['a'] + df['b']

# Don't: Repeated concatenation
result = pd.DataFrame()
for chunk in chunks:
    result = pd.concat([result, chunk])

# Do: Collect then concat once
results = []
for chunk in chunks:
    results.append(chunk)
result = pd.concat(results)

One-liners:

# Top 3 per group
df.sort_values('val', ascending=False).groupby('cat').head(3)

# Remove duplicates keeping last
df.drop_duplicates(subset=['id'], keep='last')

# Fill forward then backward
df['col'].ffill().bfill()

# Conditional column
df['label'] = np.where(df['val'] > 100, 'high', 'low')

# Multiple conditions
df['label'] = np.select(
    [df['val'] < 50, df['val'] < 100],
    ['low', 'medium'],
    default='high'
)

# Percentage change
df['pct_change'] = df['value'].pct_change() * 100

# Rank with ties
df['rank'] = df['score'].rank(method='dense', ascending=False)

# Group and get first/last
df.groupby('id').first()
df.groupby('id').last()

Interview preparation checklist:

Know difference between loc and iloc
Understand merge types (inner, left, right, outer)
Master groupby operations
Handle missing data appropriately
Use vectorized operations over loops
Know when to use apply vs vectorization
Understand datetime operations
Practice reshaping (pivot, melt)
Optimize data types for memory
Write readable, efficient code

Overview​

Data Manipulation​

Find Duplicate Rows​

Top N per Group​

Second Highest Value​

Pivot with Aggregation​

String Operations​

Extract from String​

Clean Phone Numbers​

Missing Data​

Fill Missing with Group Mean​

Forward Fill by Group​

DateTime Operations​

Calculate Age from Birthdate​

Business Days Between Dates​

Filter Last N Days​

Aggregation & GroupBy​

Running Total by Group​

Percentage of Total per Group​

Rank Within Groups​

Merging & Joining​

Find Unmatched Records​

Merge Multiple DataFrames​

Performance & Optimization​

Optimize Data Types​

Vectorized vs Loop​

Complex Queries​

Consecutive Days​

Remove Outliers by Group​

Transpose with Aggregation​

SQL-Like Operations​

Window Functions​

Self Join​

Quick Tips​