Skip to main content

Pandas Overview

What is Pandas?

pandas is the most-used Python library for data manipulation and analysis. It provides DataFrames (2D tables) and Series (1D arrays) that make working with structured data intuitive and powerful.

When to Use Pandas

Use pandas for:

  • Exploratory data analysis
  • Data cleaning and preprocessing
  • Datasets under ~10GB that fit in memory
  • Time series analysis
  • Structured/tabular data

Consider alternatives when:

  • Data exceeds RAM (use Dask, Polars, or PySpark)
  • Need maximum performance (try Polars)
  • Working with streaming data

Quick Example

Example
import pandas as pd

# Read and explore
df = pd.read_csv('sales.csv')
print(df.head())
print(df.info())

# Clean and transform
df_clean = (df
.dropna(subset=['revenue'])
.query('revenue > 0')
.assign(profit_margin=lambda x: x['profit'] / x['revenue'])
)

# Aggregate
summary = (df_clean
.groupby('category')['revenue']
.agg(['sum', 'mean', 'count'])
.sort_values('sum', ascending=False)
)

# Export
summary.to_excel('summary.xlsx')

Key Concepts

Data Structures

  • Series: 1D labeled array
  • DataFrame: 2D table
  • Index: Row and column labels

Selection

  • Column: df['col'] or df[['col1', 'col2']]
  • Position: df.iloc[0:5, 0:3]
  • Label: df.loc[rows, cols]
  • Filter: df[df['col'] > 5]

Cleaning

  • Missing: dropna(), fillna()
  • Types: astype()
  • Strings: .str accessor
  • Duplicates: drop_duplicates()

Aggregation

  • Group: groupby()
  • Combine: merge(), join(), concat()
  • Reshape: pivot_table(), melt()

Time Series

  • DateTime operations
  • Resampling (daily → monthly)
  • Rolling windows

Typical Workflow

Import → Explore → Clean → Transform → Aggregate → Export

Common Operations

TaskCommand
Read CSVpd.read_csv('file.csv')
First 5 rowsdf.head()
Data infodf.info()
Statisticsdf.describe()
Select columndf['col'] or df.col
Filter rowsdf[df['col'] > 5]
Group bydf.groupby('col').sum()
Missing valuesdf.isnull().sum()
Sortdf.sort_values('col')
Save CSVdf.to_csv('file.csv')

Installation

Install pandas package with pip:

Terminal
pip install pandas

Verify installed package:

Python
import pandas as pd

print(pd.__version__)

Resources