Outliers and Data Validation

Overview

Outliers are data points that differ significantly from other observations. Data validation ensures your data meets expected criteria before analysis.

Detecting Outliers

Statistical Methods

Z-Score Method

Identify values that are many standard deviations from the mean:

Z-score outlier detection
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'value': [10, 12, 11, 13, 100, 14, 12, 11]
})

# Calculate z-scores
mean = df['value'].mean()
std = df['value'].std()
df['z_score'] = (df['value'] - mean) / std

# Outliers: |z-score| > 3 (common threshold)
outliers = df[np.abs(df['z_score']) > 3]
#    value   z_score
# 4    100  2.89...

# Or use scipy
from scipy import stats
df['z_score'] = np.abs(stats.zscore(df['value']))
outliers = df[df['z_score'] > 3]

Z-score works well for normally distributed data.

IQR Method (Interquartile Range)

More robust to extreme values:

IQR outlier detection
df = pd.DataFrame({
    'price': [100, 110, 105, 115, 1000, 108, 112, 107]
})

# Calculate IQR
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1

# Define outlier bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Find outliers
outliers = df[(df['price'] < lower_bound) | (df['price'] > upper_bound)]
#    price
# 4   1000

# Filter out outliers
df_clean = df[(df['price'] >= lower_bound) & (df['price'] <= upper_bound)]

IQR method is the basis for box plots and works for skewed data.

Percentile Method

Simple approach using percentiles:

Percentile-based outlier detection
df = pd.DataFrame({
    'value': range(100) + [1000, 2000]  # Two outliers
})

# Remove top/bottom 1%
lower = df['value'].quantile(0.01)
upper = df['value'].quantile(0.99)

df_clean = df[(df['value'] >= lower) & (df['value'] <= upper)]

# Or remove values beyond 95th percentile
threshold = df['value'].quantile(0.95)
df[df['value'] <= threshold]

Visual Detection

Visual outlier detection
import matplotlib.pyplot as plt

df = pd.DataFrame({
    'value': [10, 12, 11, 13, 100, 14, 12, 11]
})

# Box plot (shows outliers as points)
df.boxplot(column='value')

# Histogram
df['value'].hist(bins=20)

# Scatter plot for relationships
df.plot.scatter(x='col1', y='col2')

Visual inspection helps identify outliers that statistics might miss.

Handling Outliers

Remove Outliers

Removing outliers
df = pd.DataFrame({
    'age': [25, 30, 28, 150, 35, 29],  # 150 is outlier
    'salary': [50000, 60000, 55000, 58000, 1000000, 52000]  # 1M is outlier
})

# Remove using IQR for each column
def remove_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return df[(df[column] >= lower) & (df[column] <= upper)]

df_clean = remove_outliers_iqr(df, 'age')
df_clean = remove_outliers_iqr(df_clean, 'salary')

Cap Outliers (Winsorization)

Replace extreme values with threshold values:

Cap outliers instead of removing
df = pd.DataFrame({
    'value': [1, 2, 3, 100, 4, 5, 200]
})

# Cap at 5th and 95th percentiles
lower = df['value'].quantile(0.05)
upper = df['value'].quantile(0.95)

df['value_capped'] = df['value'].clip(lower=lower, upper=upper)
#    value  value_capped
# 0      1             1
# 1      2             2
# 2      3             3
# 3    100             5  # Capped
# 4      4             4
# 5      5             5
# 6    200             5  # Capped

# Or manually
df['value_capped'] = np.where(df['value'] > upper, upper,
                              np.where(df['value'] < lower, lower, df['value']))

Capping preserves row count while limiting outlier impact.

Transform Data

Transform to reduce outlier impact
df = pd.DataFrame({
    'value': [1, 10, 100, 1000, 10000]
})

# Log transformation (for right-skewed data)
df['log_value'] = np.log(df['value'])

# Square root transformation
df['sqrt_value'] = np.sqrt(df['value'])

# Box-Cox transformation (requires positive values)
from scipy import stats
df['boxcox_value'], _ = stats.boxcox(df['value'])

Transformations make outliers less extreme without removing data.

Data Validation

Range Validation

Check if values fall within expected ranges:

Validate value ranges
df = pd.DataFrame({
    'age': [25, 30, 150, 28],  # 150 is invalid
    'score': [85, 92, 105, 78]  # 105 is invalid (max 100)
})

# Check for invalid ages
invalid_age = df[(df['age'] < 0) | (df['age'] > 120)]
#    age  score
# 2  150    105

# Check for invalid scores
invalid_score = df[(df['score'] < 0) | (df['score'] > 100)]
#    age  score
# 2  150    105

# Validate and mark
df['age_valid'] = df['age'].between(0, 120)
df['score_valid'] = df['score'].between(0, 100)

# Keep only valid rows
df_valid = df[df['age_valid'] & df['score_valid']]

Type Validation

Validate data types
df = pd.DataFrame({
    'id': ['1', '2', 'ABC', '4'],  # ABC is invalid
    'email': ['a@test.com', 'invalid', 'b@test.com', 'c@test.com']
})

# Check if id is numeric
df['id_valid'] = pd.to_numeric(df['id'], errors='coerce').notna()
#    id         email  id_valid
# 0   1    a@test.com      True
# 1   2       invalid      True
# 2 ABC    b@test.com     False
# 3   4    c@test.com      True

# Validate email format (basic)
df['email_valid'] = df['email'].str.contains('@', na=False)

# Filter valid rows
df_valid = df[df['id_valid'] & df['email_valid']]

Pattern Validation

Validate with regex patterns
df = pd.DataFrame({
    'phone': ['123-456-7890', '555-1234', '999-888-7777'],
    'zip': ['12345', '123', '54321']
})

# Validate phone format (XXX-XXX-XXXX)
df['phone_valid'] = df['phone'].str.match(r'^\d{3}-\d{3}-\d{4}$')
#          phone  phone_valid
# 0  123-456-7890         True
# 1      555-1234        False
# 2  999-888-7777         True

# Validate zip code (5 digits)
df['zip_valid'] = df['zip'].str.match(r'^\d{5}$')
#     zip  zip_valid
# 0  12345       True
# 1    123      False
# 2  54321       True

Uniqueness Validation

Check for required uniqueness
df = pd.DataFrame({
    'user_id': [1, 2, 3, 2, 4],  # 2 is duplicate
    'email': ['a@test.com', 'b@test.com', 'c@test.com', 'b@test.com', 'd@test.com']
})

# Check if user_id is unique
is_unique = df['user_id'].is_unique  # False

# Find duplicates
duplicates = df[df.duplicated(subset=['user_id'], keep=False)]
#    user_id        email
# 1        2  b@test.com
# 3        2  b@test.com

# Validate uniqueness per row
df['id_is_unique'] = ~df.duplicated(subset=['user_id'], keep=False)

Completeness Validation

Check for required fields
df = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Alice', None, 'Charlie', 'David'],
    'email': ['a@test.com', 'b@test.com', None, 'd@test.com']
})

# Check required fields
required_fields = ['id', 'name', 'email']
df['is_complete'] = df[required_fields].notna().all(axis=1)
#    id     name        email  is_complete
# 0   1    Alice   a@test.com         True
# 1   2     None   b@test.com        False
# 2   3  Charlie         None        False
# 3   4    David   d@test.com         True

# Get incomplete rows
incomplete = df[~df['is_complete']]

Cross-Field Validation

Validate relationships between fields
df = pd.DataFrame({
    'start_date': pd.to_datetime(['2024-01-01', '2024-02-01', '2024-03-01']),
    'end_date': pd.to_datetime(['2024-01-15', '2024-01-20', '2024-03-10'])
})

# End date should be after start date
df['date_valid'] = df['end_date'] > df['start_date']
#   start_date   end_date  date_valid
# 0 2024-01-01 2024-01-15        True
# 1 2024-02-01 2024-01-20       False
# 2 2024-03-01 2024-03-10        True

# Another example: discount <= price
df = pd.DataFrame({
    'price': [100, 200, 150],
    'discount': [10, 250, 20]  # 250 is invalid
})
df['discount_valid'] = df['discount'] <= df['price']

Building Validation Functions

Create Validation Pipeline

Reusable validation function
def validate_dataframe(df):
    """Validate DataFrame and return issues."""
    issues = []
    
    # Check for missing values in required columns
    required = ['id', 'name', 'email']
    for col in required:
        if col in df.columns:
            missing_count = df[col].isnull().sum()
            if missing_count > 0:
                issues.append(f"{col}: {missing_count} missing values")
    
    # Check age range
    if 'age' in df.columns:
        invalid_age = df[(df['age'] < 0) | (df['age'] > 120)]
        if len(invalid_age) > 0:
            issues.append(f"age: {len(invalid_age)} invalid values")
    
    # Check for duplicates
    if 'id' in df.columns:
        dup_count = df['id'].duplicated().sum()
        if dup_count > 0:
            issues.append(f"id: {dup_count} duplicates")
    
    return issues

# Use it
issues = validate_dataframe(df)
if issues:
    print("Validation issues found:")
    for issue in issues:
        print(f"  - {issue}")
else:
    print("All validations passed!")

Validation Report

Generate validation report
def validation_report(df):
    """Create comprehensive validation report."""
    report = {
        'total_rows': len(df),
        'total_columns': len(df.columns),
        'missing_values': df.isnull().sum().to_dict(),
        'duplicate_rows': df.duplicated().sum(),
        'dtypes': df.dtypes.astype(str).to_dict()
    }
    
    # Numeric columns statistics
    numeric_cols = df.select_dtypes(include=['number']).columns
    if len(numeric_cols) > 0:
        report['numeric_summary'] = df[numeric_cols].describe().to_dict()
    
    return report

# Generate report
report = validation_report(df)
print(f"Total rows: {report['total_rows']}")
print(f"Duplicates: {report['duplicate_rows']}")
print(f"Missing values: {report['missing_values']}")

Common Validation Patterns

Email Validation

Validate email addresses
df = pd.DataFrame({
    'email': ['user@example.com', 'invalid', 'test@domain.co.uk', '@missing.com']
})

# Basic validation
df['email_valid'] = df['email'].str.contains(
    r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$',
    na=False,
    regex=True
)

# Get invalid emails
invalid_emails = df[~df['email_valid']]

Date Validation

Validate dates
df = pd.DataFrame({
    'date_str': ['2024-01-01', '2024-13-01', '2024-02-30', '2024-03-15']
})

# Try to parse, invalid become NaT
df['date'] = pd.to_datetime(df['date_str'], errors='coerce')
df['date_valid'] = df['date'].notna()
#     date_str       date  date_valid
# 0 2024-01-01 2024-01-01        True
# 1 2024-13-01        NaT       False
# 2 2024-02-30        NaT       False
# 3 2024-03-15 2024-03-15        True

Numeric Range Validation

Validate numeric ranges
df = pd.DataFrame({
    'temperature': [-300, 25, 30, 500],  # -300 and 500 invalid
    'humidity': [45, 60, 105, 80]  # 105 invalid (0-100)
})

# Define valid ranges
ranges = {
    'temperature': (-100, 100),
    'humidity': (0, 100)
}

# Validate
for col, (min_val, max_val) in ranges.items():
    df[f'{col}_valid'] = df[col].between(min_val, max_val)

# Get all invalid rows
is_valid = df[[c for c in df.columns if c.endswith('_valid')]].all(axis=1)
invalid_rows = df[~is_valid]

Handling Invalid Data

Flag and Keep

Mark invalid data without removing
df = pd.DataFrame({
    'value': [10, -5, 20, 1000, 30]
})

# Flag outliers
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
df['is_outlier'] = (df['value'] < Q1 - 1.5*IQR) | (df['value'] > Q3 + 1.5*IQR)

# Analyze with and without outliers
print(f"Mean with outliers: {df['value'].mean()}")
print(f"Mean without: {df[~df['is_outlier']]['value'].mean()}")

Separate Invalid Records

Split valid and invalid data
df = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'age': [25, -5, 30, 150]
})

# Separate valid and invalid
valid_mask = df['age'].between(0, 120)
df_valid = df[valid_mask]
df_invalid = df[~valid_mask]

# Save for review
df_invalid.to_csv('invalid_records.csv', index=False)
df_valid.to_csv('valid_records.csv', index=False)

Apply Corrections

Auto-correct common issues
df = pd.DataFrame({
    'email': ['USER@TEST.COM', ' user@test.com ', 'user@TEST.com']
})

# Standardize emails
df['email_clean'] = (df['email']
    .str.strip()           # Remove whitespace
    .str.lower()           # Lowercase
)
#               email     email_clean
# 0   USER@TEST.COM   user@test.com
# 1  user@test.com    user@test.com
# 2   user@TEST.com   user@test.com

Quick Reference

Detect outliers:

# Z-score method
z_scores = np.abs((df['col'] - df['col'].mean()) / df['col'].std())
outliers = df[z_scores > 3]

# IQR method
Q1, Q3 = df['col'].quantile([0.25, 0.75])
IQR = Q3 - Q1
outliers = df[(df['col'] < Q1-1.5*IQR) | (df['col'] > Q3+1.5*IQR)]

# Percentile method
upper = df['col'].quantile(0.99)
outliers = df[df['col'] > upper]

Handle outliers:

df_clean = df[~outliers_mask]           # Remove
df['col_capped'] = df['col'].clip(lower, upper)  # Cap
df['col_log'] = np.log(df['col'])       # Transform

Validate:

df['valid'] = df['col'].between(min, max)  # Range
df['valid'] = df['col'].str.match(pattern)  # Pattern
df['valid'] = df['col'].notna()         # Not null
df['valid'] = ~df.duplicated(subset=['col'])  # Unique

Common validations:

# Email
df['col'].str.contains(r'^[\w\.-]+@[\w\.-]+\.\w+$')

# Phone (US)
df['col'].str.match(r'^\d{3}-\d{3}-\d{4}$')

# Zip code (US)
df['col'].str.match(r'^\d{5}$')

# Date
pd.to_datetime(df['col'], errors='coerce').notna()

Overview​

Detecting Outliers​

Statistical Methods​

Z-Score Method​

IQR Method (Interquartile Range)​

Percentile Method​

Visual Detection​

Handling Outliers​

Remove Outliers​

Cap Outliers (Winsorization)​

Transform Data​

Data Validation​

Range Validation​

Type Validation​

Pattern Validation​

Uniqueness Validation​

Completeness Validation​

Cross-Field Validation​

Building Validation Functions​

Create Validation Pipeline​

Validation Report​

Common Validation Patterns​

Email Validation​

Date Validation​

Numeric Range Validation​

Handling Invalid Data​

Flag and Keep​

Separate Invalid Records​

Apply Corrections​

Quick Reference​