Skip to main content

Outliers and Data Validation

Overview

Outliers are data points that differ significantly from other observations. Data validation ensures your data meets expected criteria before analysis.

Detecting Outliers

Statistical Methods

Z-Score Method

Identify values that are many standard deviations from the mean:

Z-score outlier detection
import pandas as pd
import numpy as np

df = pd.DataFrame({
'value': [10, 12, 11, 13, 100, 14, 12, 11]
})

# Calculate z-scores
mean = df['value'].mean()
std = df['value'].std()
df['z_score'] = (df['value'] - mean) / std

# Outliers: |z-score| > 3 (common threshold)
outliers = df[np.abs(df['z_score']) > 3]
# value z_score
# 4 100 2.89...

# Or use scipy
from scipy import stats
df['z_score'] = np.abs(stats.zscore(df['value']))
outliers = df[df['z_score'] > 3]

Z-score works well for normally distributed data.

IQR Method (Interquartile Range)

More robust to extreme values:

IQR outlier detection
df = pd.DataFrame({
'price': [100, 110, 105, 115, 1000, 108, 112, 107]
})

# Calculate IQR
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1

# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Find outliers
outliers = df[(df['price'] < lower_bound) | (df['price'] > upper_bound)]
# price
# 4 1000

# Filter out outliers
df_clean = df[(df['price'] >= lower_bound) & (df['price'] <= upper_bound)]

IQR method is the basis for box plots and works for skewed data.

Percentile Method

Simple approach using percentiles:

Percentile-based outlier detection
df = pd.DataFrame({
'value': range(100) + [1000, 2000] # Two outliers
})

# Remove top/bottom 1%
lower = df['value'].quantile(0.01)
upper = df['value'].quantile(0.99)

df_clean = df[(df['value'] >= lower) & (df['value'] <= upper)]

# Or remove values beyond 95th percentile
threshold = df['value'].quantile(0.95)
df[df['value'] <= threshold]

Visual Detection

Visual outlier detection
import matplotlib.pyplot as plt

df = pd.DataFrame({
'value': [10, 12, 11, 13, 100, 14, 12, 11]
})

# Box plot (shows outliers as points)
df.boxplot(column='value')

# Histogram
df['value'].hist(bins=20)

# Scatter plot for relationships
df.plot.scatter(x='col1', y='col2')

Visual inspection helps identify outliers that statistics might miss.

Handling Outliers

Remove Outliers

Removing outliers
df = pd.DataFrame({
'age': [25, 30, 28, 150, 35, 29], # 150 is outlier
'salary': [50000, 60000, 55000, 58000, 1000000, 52000] # 1M is outlier
})

# Remove using IQR for each column
def remove_outliers_iqr(df, column):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
return df[(df[column] >= lower) & (df[column] <= upper)]

df_clean = remove_outliers_iqr(df, 'age')
df_clean = remove_outliers_iqr(df_clean, 'salary')

Cap Outliers (Winsorization)

Replace extreme values with threshold values:

Cap outliers instead of removing
df = pd.DataFrame({
'value': [1, 2, 3, 100, 4, 5, 200]
})

# Cap at 5th and 95th percentiles
lower = df['value'].quantile(0.05)
upper = df['value'].quantile(0.95)

df['value_capped'] = df['value'].clip(lower=lower, upper=upper)
# value value_capped
# 0 1 1
# 1 2 2
# 2 3 3
# 3 100 5 # Capped
# 4 4 4
# 5 5 5
# 6 200 5 # Capped

# Or manually
df['value_capped'] = np.where(df['value'] > upper, upper,
np.where(df['value'] < lower, lower, df['value']))

Capping preserves row count while limiting outlier impact.

Transform Data

Transform to reduce outlier impact
df = pd.DataFrame({
'value': [1, 10, 100, 1000, 10000]
})

# Log transformation (for right-skewed data)
df['log_value'] = np.log(df['value'])

# Square root transformation
df['sqrt_value'] = np.sqrt(df['value'])

# Box-Cox transformation (requires positive values)
from scipy import stats
df['boxcox_value'], _ = stats.boxcox(df['value'])

Transformations make outliers less extreme without removing data.

Data Validation

Range Validation

Check if values fall within expected ranges:

Validate value ranges
df = pd.DataFrame({
'age': [25, 30, 150, 28], # 150 is invalid
'score': [85, 92, 105, 78] # 105 is invalid (max 100)
})

# Check for invalid ages
invalid_age = df[(df['age'] < 0) | (df['age'] > 120)]
# age score
# 2 150 105

# Check for invalid scores
invalid_score = df[(df['score'] < 0) | (df['score'] > 100)]
# age score
# 2 150 105

# Validate and mark
df['age_valid'] = df['age'].between(0, 120)
df['score_valid'] = df['score'].between(0, 100)

# Keep only valid rows
df_valid = df[df['age_valid'] & df['score_valid']]

Type Validation

Validate data types
df = pd.DataFrame({
'id': ['1', '2', 'ABC', '4'], # ABC is invalid
'email': ['a@test.com', 'invalid', 'b@test.com', 'c@test.com']
})

# Check if id is numeric
df['id_valid'] = pd.to_numeric(df['id'], errors='coerce').notna()
# id email id_valid
# 0 1 a@test.com True
# 1 2 invalid True
# 2 ABC b@test.com False
# 3 4 c@test.com True

# Validate email format (basic)
df['email_valid'] = df['email'].str.contains('@', na=False)

# Filter valid rows
df_valid = df[df['id_valid'] & df['email_valid']]

Pattern Validation

Validate with regex patterns
df = pd.DataFrame({
'phone': ['123-456-7890', '555-1234', '999-888-7777'],
'zip': ['12345', '123', '54321']
})

# Validate phone format (XXX-XXX-XXXX)
df['phone_valid'] = df['phone'].str.match(r'^\d{3}-\d{3}-\d{4}$')
# phone phone_valid
# 0 123-456-7890 True
# 1 555-1234 False
# 2 999-888-7777 True

# Validate zip code (5 digits)
df['zip_valid'] = df['zip'].str.match(r'^\d{5}$')
# zip zip_valid
# 0 12345 True
# 1 123 False
# 2 54321 True

Uniqueness Validation

Check for required uniqueness
df = pd.DataFrame({
'user_id': [1, 2, 3, 2, 4], # 2 is duplicate
'email': ['a@test.com', 'b@test.com', 'c@test.com', 'b@test.com', 'd@test.com']
})

# Check if user_id is unique
is_unique = df['user_id'].is_unique # False

# Find duplicates
duplicates = df[df.duplicated(subset=['user_id'], keep=False)]
# user_id email
# 1 2 b@test.com
# 3 2 b@test.com

# Validate uniqueness per row
df['id_is_unique'] = ~df.duplicated(subset=['user_id'], keep=False)

Completeness Validation

Check for required fields
df = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Alice', None, 'Charlie', 'David'],
'email': ['a@test.com', 'b@test.com', None, 'd@test.com']
})

# Check required fields
required_fields = ['id', 'name', 'email']
df['is_complete'] = df[required_fields].notna().all(axis=1)
# id name email is_complete
# 0 1 Alice a@test.com True
# 1 2 None b@test.com False
# 2 3 Charlie None False
# 3 4 David d@test.com True

# Get incomplete rows
incomplete = df[~df['is_complete']]

Cross-Field Validation

Validate relationships between fields
df = pd.DataFrame({
'start_date': pd.to_datetime(['2024-01-01', '2024-02-01', '2024-03-01']),
'end_date': pd.to_datetime(['2024-01-15', '2024-01-20', '2024-03-10'])
})

# End date should be after start date
df['date_valid'] = df['end_date'] > df['start_date']
# start_date end_date date_valid
# 0 2024-01-01 2024-01-15 True
# 1 2024-02-01 2024-01-20 False
# 2 2024-03-01 2024-03-10 True

# Another example: discount <= price
df = pd.DataFrame({
'price': [100, 200, 150],
'discount': [10, 250, 20] # 250 is invalid
})
df['discount_valid'] = df['discount'] <= df['price']

Building Validation Functions

Create Validation Pipeline

Reusable validation function
def validate_dataframe(df):
"""Validate DataFrame and return issues."""
issues = []

# Check for missing values in required columns
required = ['id', 'name', 'email']
for col in required:
if col in df.columns:
missing_count = df[col].isnull().sum()
if missing_count > 0:
issues.append(f"{col}: {missing_count} missing values")

# Check age range
if 'age' in df.columns:
invalid_age = df[(df['age'] < 0) | (df['age'] > 120)]
if len(invalid_age) > 0:
issues.append(f"age: {len(invalid_age)} invalid values")

# Check for duplicates
if 'id' in df.columns:
dup_count = df['id'].duplicated().sum()
if dup_count > 0:
issues.append(f"id: {dup_count} duplicates")

return issues

# Use it
issues = validate_dataframe(df)
if issues:
print("Validation issues found:")
for issue in issues:
print(f" - {issue}")
else:
print("All validations passed!")

Validation Report

Generate validation report
def validation_report(df):
"""Create comprehensive validation report."""
report = {
'total_rows': len(df),
'total_columns': len(df.columns),
'missing_values': df.isnull().sum().to_dict(),
'duplicate_rows': df.duplicated().sum(),
'dtypes': df.dtypes.astype(str).to_dict()
}

# Numeric columns statistics
numeric_cols = df.select_dtypes(include=['number']).columns
if len(numeric_cols) > 0:
report['numeric_summary'] = df[numeric_cols].describe().to_dict()

return report

# Generate report
report = validation_report(df)
print(f"Total rows: {report['total_rows']}")
print(f"Duplicates: {report['duplicate_rows']}")
print(f"Missing values: {report['missing_values']}")

Common Validation Patterns

Email Validation

Validate email addresses
df = pd.DataFrame({
'email': ['user@example.com', 'invalid', 'test@domain.co.uk', '@missing.com']
})

# Basic validation
df['email_valid'] = df['email'].str.contains(
r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$',
na=False,
regex=True
)

# Get invalid emails
invalid_emails = df[~df['email_valid']]

Date Validation

Validate dates
df = pd.DataFrame({
'date_str': ['2024-01-01', '2024-13-01', '2024-02-30', '2024-03-15']
})

# Try to parse, invalid become NaT
df['date'] = pd.to_datetime(df['date_str'], errors='coerce')
df['date_valid'] = df['date'].notna()
# date_str date date_valid
# 0 2024-01-01 2024-01-01 True
# 1 2024-13-01 NaT False
# 2 2024-02-30 NaT False
# 3 2024-03-15 2024-03-15 True

Numeric Range Validation

Validate numeric ranges
df = pd.DataFrame({
'temperature': [-300, 25, 30, 500], # -300 and 500 invalid
'humidity': [45, 60, 105, 80] # 105 invalid (0-100)
})

# Define valid ranges
ranges = {
'temperature': (-100, 100),
'humidity': (0, 100)
}

# Validate
for col, (min_val, max_val) in ranges.items():
df[f'{col}_valid'] = df[col].between(min_val, max_val)

# Get all invalid rows
is_valid = df[[c for c in df.columns if c.endswith('_valid')]].all(axis=1)
invalid_rows = df[~is_valid]

Handling Invalid Data

Flag and Keep

Mark invalid data without removing
df = pd.DataFrame({
'value': [10, -5, 20, 1000, 30]
})

# Flag outliers
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
df['is_outlier'] = (df['value'] < Q1 - 1.5*IQR) | (df['value'] > Q3 + 1.5*IQR)

# Analyze with and without outliers
print(f"Mean with outliers: {df['value'].mean()}")
print(f"Mean without: {df[~df['is_outlier']]['value'].mean()}")

Separate Invalid Records

Split valid and invalid data
df = pd.DataFrame({
'id': [1, 2, 3, 4],
'age': [25, -5, 30, 150]
})

# Separate valid and invalid
valid_mask = df['age'].between(0, 120)
df_valid = df[valid_mask]
df_invalid = df[~valid_mask]

# Save for review
df_invalid.to_csv('invalid_records.csv', index=False)
df_valid.to_csv('valid_records.csv', index=False)

Apply Corrections

Auto-correct common issues
df = pd.DataFrame({
'email': ['USER@TEST.COM', ' user@test.com ', 'user@TEST.com']
})

# Standardize emails
df['email_clean'] = (df['email']
.str.strip() # Remove whitespace
.str.lower() # Lowercase
)
# email email_clean
# 0 USER@TEST.COM user@test.com
# 1 user@test.com user@test.com
# 2 user@TEST.com user@test.com

Quick Reference

Detect outliers:

# Z-score method
z_scores = np.abs((df['col'] - df['col'].mean()) / df['col'].std())
outliers = df[z_scores > 3]

# IQR method
Q1, Q3 = df['col'].quantile([0.25, 0.75])
IQR = Q3 - Q1
outliers = df[(df['col'] < Q1-1.5*IQR) | (df['col'] > Q3+1.5*IQR)]

# Percentile method
upper = df['col'].quantile(0.99)
outliers = df[df['col'] > upper]

Handle outliers:

df_clean = df[~outliers_mask]           # Remove
df['col_capped'] = df['col'].clip(lower, upper) # Cap
df['col_log'] = np.log(df['col']) # Transform

Validate:

df['valid'] = df['col'].between(min, max)  # Range
df['valid'] = df['col'].str.match(pattern) # Pattern
df['valid'] = df['col'].notna() # Not null
df['valid'] = ~df.duplicated(subset=['col']) # Unique

Common validations:

# Email
df['col'].str.contains(r'^[\w\.-]+@[\w\.-]+\.\w+$')

# Phone (US)
df['col'].str.match(r'^\d{3}-\d{3}-\d{4}$')

# Zip code (US)
df['col'].str.match(r'^\d{5}$')

# Date
pd.to_datetime(df['col'], errors='coerce').notna()