Core Concepts
Overview
pandas has three fundamental data structures:
- Series: 1D labeled array
- DataFrame: 2D labeled table
- Index: Labels for rows and columns
Series
A Series is a 1D array with labels (index).
Creating Series
import pandas as pd
# From list
s = pd.Series([10, 20, 30, 40])
# 0 10
# 1 20
# 2 30
# 3 40
# With custom index
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
# a 10
# b 20
# c 30
# From dictionary
s = pd.Series({'a': 10, 'b': 20, 'c': 30})
# From scalar
s = pd.Series(5, index=['a', 'b', 'c'])
# a 5
# b 5
# c 5
Accessing Series Data
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
# By label
s['a'] # 10
s[['a', 'c']] # Multiple values
# By position
s.iloc[0] # 10
s.iloc[0:2] # First two
# Boolean indexing
s[s > 15] # Values greater than 15
Series Attributes
s.values # NumPy array of values
s.index # Index object
s.dtype # Data type
s.shape # (3,)
s.size # 3
s.name # Series name (None by default)
Common Series Operations
s = pd.Series([10, 20, 30])
# Arithmetic
s + 5 # Add 5 to all
s * 2 # Multiply all by 2
s1 + s2 # Element-wise addition
# Statistics
s.mean()
s.sum()
s.std()
s.min()
s.max()
# Sorting
s.sort_values() # By values
s.sort_index() # By index
# Unique and counts
s.unique()
s.value_counts()
DataFrame
A DataFrame is a 2D table with labeled rows and columns. Think of it as a dictionary of Series sharing the same index.
Creating DataFrames
# From dictionary
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'city': ['NYC', 'LA', 'Chicago']
})
# From list of lists
df = pd.DataFrame(
[['Alice', 25], ['Bob', 30]],
columns=['name', 'age']
)
# From list of dictionaries
df = pd.DataFrame([
{'name': 'Alice', 'age': 25},
{'name': 'Bob', 'age': 30}
])
# From NumPy array
import numpy as np
df = pd.DataFrame(
np.random.randn(3, 2),
columns=['A', 'B']
)
DataFrame Structure
df = pd.DataFrame({
'name': ['Alice', 'Bob'],
'age': [25, 30],
'city': ['NYC', 'LA']
})
# name age city
# 0 Alice 25 NYC
# 1 Bob 30 LA
Each column is a Series:
df['age'] # Series
type(df['age']) # pandas.core.series.Series
DataFrame Attributes
df.shape # (rows, columns) e.g., (2, 3)
df.size # Total elements: 6
df.columns # Column names
df.index # Row labels
df.dtypes # Data type per column
df.info() # Summary: types, non-null counts, memory
df.describe() # Statistics for numeric columns
Accessing DataFrame Data
# Columns
df['age'] # Single column (Series)
df[['name', 'age']] # Multiple columns (DataFrame)
# Rows by position
df.iloc[0] # First row (Series)
df.iloc[0:2] # First two rows (DataFrame)
# Rows by label
df.loc[0] # Row with label 0
df.loc[0:1] # Rows 0 to 1 (inclusive)
# Specific cells
df.loc[0, 'name'] # Row 0, column 'name'
df.iloc[0, 1] # Row 0, column 1
Adding and Removing Columns
# Add column
df['country'] = 'USA' # Scalar
df['age_doubled'] = df['age'] * 2 # From calculation
# Remove column
df.drop('age_doubled', axis=1) # Returns new DataFrame
df.drop('age_doubled', axis=1, inplace=True) # Modifies in place
del df['country'] # In-place deletion
Common DataFrame Operations
# Sorting
df.sort_values('age') # By column
df.sort_values('age', ascending=False) # Descending
df.sort_index() # By index
# Filtering
df[df['age'] > 25] # Boolean condition
df.query('age > 25') # Query syntax
# Statistics
df.mean() # Mean of numeric columns
df['age'].sum() # Sum of one column
df.corr() # Correlation matrix
# Info
df.head() # First 5 rows
df.tail(3) # Last 3 rows
df.sample(2) # Random 2 rows
Index
The Index is the label for rows (and columns). It enables fast lookups and alignment.
Index Basics
# Default numeric index
df = pd.DataFrame({'A': [1, 2, 3]})
# Index: RangeIndex(start=0, stop=3, step=1)
# Custom index
df = pd.DataFrame(
{'A': [1, 2, 3]},
index=['a', 'b', 'c']
)
# Index: Index(['a', 'b', 'c'], dtype='object')
Setting and Resetting Index
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35]
})
# Set column as index
df.set_index('name')
# age
# name
# Alice 25
# Bob 30
# Charlie 35
# Reset index (move index back to column)
df_indexed = df.set_index('name')
df_indexed.reset_index()
# name age
# 0 Alice 25
# 1 Bob 30
# 2 Charlie 35
# Drop index when resetting
df_indexed.reset_index(drop=True)
Index Properties
df.index.name # Name of index
df.index.names # Names (for MultiIndex)
df.index.is_unique # Check if unique
df.index.dtype # Data type
Why Index Matters
Fast lookups
df.set_index('name').loc['Alice'] # Fast label-based access
Automatic alignment
s1 = pd.Series([1, 2], index=['a', 'b'])
s2 = pd.Series([3, 4], index=['b', 'c'])
s1 + s2
# a NaN
# b 5.0
# c NaN
Time series indexing
df.index = pd.to_datetime(df['date'])
df['2024'] # All rows in 2024
df['2024-01'] # All rows in January 2024
Relationship Between Series and DataFrame
A DataFrame is a collection of Series with a shared index:
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
# Each column is a Series
df['A'] # Series
df['B'] # Series
# Each row is also a Series
df.iloc[0] # Series with index ['A', 'B']
Converting between them:
# DataFrame to Series
s = df['A'] # Single column
s = df.iloc[0] # Single row
# Series to DataFrame
df_new = s.to_frame() # Single column
df_new = s.to_frame(name='column_name')
Common Gotchas
Views vs Copies
# This might be a view or a copy
df_subset = df[df['age'] > 25]
df_subset['age'] = 100 # May or may not modify df
# Explicit copy
df_subset = df[df['age'] > 25].copy()
df_subset['age'] = 100 # Won't affect df
Chained Indexing
# Bad: chained indexing
df[df['age'] > 25]['age'] = 100 # Warning!
# Good: use loc
df.loc[df['age'] > 25, 'age'] = 100
Column Names
# Avoid spaces in column names
df.columns = ['first name', 'age'] # Can't use df.first name
df.columns = ['first_name', 'age'] # Can use df.first_name
Quick Reference
Create
pd.Series([1, 2, 3])
pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
Access
df['col'] # Column
df.loc[label] # Row by label
df.iloc[position] # Row by position
Modify
df['new'] = values
df.drop('col', axis=1)
df.set_index('col')
df.reset_index()
Info
df.shape
df.dtypes
df.info()
df.describe()