Selecting Columns

Overview

Column selection is one of the most common operations in pandas. There are several ways to select columns, each with different use cases and behaviors.

The main approaches:

Bracket notation: df['col'] or df[['col1', 'col2']]
Dot notation: df.col
loc/iloc: df.loc[:, 'col'] or df.iloc[:, 0]
Column methods: filter(), select_dtypes()

Single Column Selection

Bracket Notation (Recommended)

The most common way to select a single column:

Single column with brackets - returns Series
import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['NYC', 'LA', 'Chicago']
})

# Returns a Series
column = df['age']
type(column)  # pandas.core.series.Series

print(column)
# 0    25
# 1    30
# 2    35
# Name: age, dtype: int64

A single bracket returns a Series, which is a 1D labeled array.

Dot Notation (Convenient but Limited)

Access columns as attributes:

Dot notation - convenient for simple names
# Same as df['age']
df.age

# Works for exploration and quick access
df.name
df.city

Limitations of dot notation:

When dot notation doesn't work
# Column names with spaces
df['first name']  # Works
df.first
name  # SyntaxError!

# Column names that match DataFrame methods
df['count']  # Works
df.count  # Returns a method, not the column!

# Column names starting with numbers
df['2024_sales']  # Works
df
.2024
_sales  # SyntaxError!

# Setting new columns
df['new_col'] = 10  # Works
df.new_col = 10  # Creates an attribute, not a column!

Use dot notation for quick exploration, but stick to brackets for robust code.

With loc

Select a column using loc (returns Series):

Single column with loc
# All rows, single column
df.loc[:, 'age']

# This is equivalent to df['age']
# Use loc when you're also filtering rows:
df.loc[df['age'] > 28, 'name']

Multiple Column Selection

Double Brackets (Returns DataFrame)

Pass a list of column names to get a DataFrame:

Multiple columns - returns DataFrame
# Select two columns
df[['name', 'city']]
#       name     city
# 0    Alice      NYC
# 1      Bob       LA
# 2  Charlie  Chicago

type(df[['name', 'city']])  # pandas.core.frame.DataFrame

# Even a single column in double brackets returns DataFrame
df[['name']]  # DataFrame with one column
df['name']  # Series

The double bracket returns a DataFrame, which maintains the 2D structure.

Column Order

The order in the list determines the column order in the result:

Reordering columns during selection
# Original order: name, age, city
df[['city', 'name', 'age']]
#       city     name  age
# 0      NYC    Alice   25
# 1       LA      Bob   30
# 2  Chicago  Charlie   35

# Useful for rearranging columns
df = df[['age', 'name', 'city']]  # Reorder permanently

With loc

Select multiple columns using loc:

Multiple columns with loc
# All rows, specific columns
df.loc[:, ['name', 'city']]

# Combined with row filtering
df.loc[df['age'] > 28, ['name', 'city']]
#       name     city
# 1      Bob       LA
# 2  Charlie  Chicago

Column Slicing

Slice by Label

Select a range of columns using slices:

Column slicing with loc (inclusive)
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9],
    'D': [10, 11, 12]
})

# Select columns from B to D (inclusive)
df.loc[:, 'B':'D']
#    B  C   D
# 0  4  7  10
# 1  5  8  11
# 2  6  9  12

# Note: This is inclusive on both ends
# df.loc[:, 'B':'D'] includes B, C, AND D

Remember: loc slicing is inclusive, unlike Python's standard slicing.

Slice by Position

Use iloc to slice by column position:

Column slicing with iloc (exclusive)
# First two columns (positions 0 and 1)
df.iloc[:, 0:2]
#    A  B
# 0  1  4
# 1  2  5
# 2  3  6

# Last two columns
df.iloc[:, -2:]

# Every other column
df.iloc[:, ::2]
#    A  C
# 0  1  7
# 1  2  8
# 2  3  9

# All except first column
df.iloc[:, 1:]

This follows Python's standard slicing rules (exclusive endpoint).

Column Selection by Type

select_dtypes()

Select columns based on their data type:

Select columns by data type
df = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    'age': [25, 30],
    'salary': [50000.0, 60000.0],
    'hired': pd.to_datetime(['2020-01-15', '2019-03-20']),
    'active': [True, False]
})

# Select numeric columns
df.select_dtypes(include=['number'])
#    age   salary
# 0   25  50000.0
# 1   30  60000.0

# Select only integers
df.select_dtypes(include=['int64'])
#    age
# 0   25
# 1   30

# Select only floats
df.select_dtypes(include=['float64'])
#     salary
# 0  50000.0
# 1  60000.0

# Select string columns
df.select_dtypes(include=['object'])
#     name
# 0  Alice
# 1    Bob

# Select datetime columns
df.select_dtypes(include=['datetime64'])
#        hired
# 0 2020-01-15
# 1 2019-03-20

# Select boolean columns
df.select_dtypes(include=['bool'])
#    active
# 0    True
# 1   False

This is extremely useful when you want to apply operations only to specific types.

Exclude Data Types

Exclude columns by data type
# Everything except numbers
df.select_dtypes(exclude=['number'])
#     name      hired  active
# 0  Alice 2020-01-15    True
# 1    Bob 2019-03-20   False

# Everything except strings
df.select_dtypes(exclude=['object'])
#    age   salary      hired  active
# 0   25  50000.0 2020-01-15    True
# 1   30  60000.0 2019-03-20   False

Multiple Types

Select multiple data types
# Select both integers and floats
df.select_dtypes(include=['int64', 'float64'])
#    age   salary
# 0   25  50000.0
# 1   30  60000.0

# Or use general category
df.select_dtypes(include=['number'])  # Same result

Common type categories:

'number': All numeric types (int, float)
'object': String/mixed types
'datetime64': Datetime types
'bool': Boolean
'category': Categorical

Column Selection by Pattern

filter() Method

Select columns using pattern matching:

Filter columns by name pattern
df = pd.DataFrame({
    'sales_2023': [100, 200],
    'sales_2024': [150, 250],
    'cost_2023': [50, 80],
    'cost_2024': [60, 90],
    'profit': [50, 120]
})

# Columns containing 'sales'
df.filter(like='sales')
#    sales_2023  sales_2024
# 0         100         150
# 1         200         250

# Columns starting with 'cost'
df.filter(regex='^cost')
#    cost_2023  cost_2024
# 0         50         60
# 1         80         90

# Columns ending with '2024'
df.filter(regex='2024$')
#    sales_2024  cost_2024
# 0         150         60
# 1         250         90

# Exact match (using items parameter)
df.filter(items=['sales_2023', 'profit'])
#    sales_2023  profit
# 0         100      50
# 1         200     120

The regex parameter is powerful for complex patterns.

List Comprehension

More flexible pattern matching:

Using list comprehension for column selection
# Columns containing '2023'
cols_2023 = [col for col in df.columns if '2023' in col]
df[cols_2023]
#    sales_2023  cost_2023
# 0         100         50
# 1         200         80

# Columns starting with 's'
s_cols = [col for col in df.columns if col.startswith('s')]
df[s_cols]

# Columns with more than 10 characters
long_cols = [col for col in df.columns if len(col) > 10]
df[long_cols]

# Custom logic
numeric_name_cols = [col for col in df.columns if any(char.isdigit() for char in col)]
df[numeric_name_cols]

Conditional Column Selection

Select Based on Column Content

Choose columns based on their values:

Select columns with specific characteristics
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [0, 0, 0, 0, 0],
    'C': [10, 20, 30, 40, 50],
    'D': [1, 1, 1, 1, 1]
})

# Columns with any non-zero values
non_zero_cols = df.columns[(df != 0).any()]
df[non_zero_cols]
#    A   C  D
# 0  1  10  1
# 1  2  20  1
# 2  3  30  1

# Columns with variance > 0 (non-constant)
variable_cols = df.columns[df.var() > 0]
df[variable_cols]
#    A   C
# 0  1  10
# 1  2  20
# 2  3  30

# Columns with missing values
null_cols = df.columns[df.isnull().any()]
df[null_cols]

# Columns with all unique values
unique_cols = df.columns[df.nunique() == len(df)]
df[unique_cols]

This helps identify and select meaningful columns automatically.

Select Based on Statistics

Select columns by statistical properties
# Columns with mean > 10
high_mean_cols = df.columns[df.mean() > 10]
df[high_mean_cols]

# Columns with max value > 30
high_max_cols = df.columns[df.max() > 30]
df[high_max_cols]

# Columns with standard deviation > 5
high_std_cols = df.columns[df.std() > 5]
df[high_std_cols]

Dropping Columns

drop() Method

Remove columns by name:

Dropping columns
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})

# Drop single column (returns new DataFrame)
df.drop('B', axis=1)
#    A  C
# 0  1  7
# 1  2  8
# 2  3  9

# Drop multiple columns
df.drop(['A', 'C'], axis=1)
#    B
# 0  4
# 1  5
# 2  6

# Drop in place (modifies original)
df.drop('B', axis=1, inplace=True)

# Using columns parameter (clearer)
df.drop(columns=['B', 'C'])

Selecting What to Keep

Sometimes it's easier to select what you want to keep:

Keep vs drop approach
df = pd.DataFrame({
    'A': [1, 2], 'B': [3, 4], 'C': [5, 6],
    'D': [7, 8], 'E': [9, 10]
})

# If you want to drop many columns, keeping is easier
# Drop A, B, C, D (keep only E)
df.drop(['A', 'B', 'C', 'D'], axis=1)  # Verbose

# Better: just select E
df[['E']]  # Clearer intent

# Keep A and E, drop the rest
df[['A', 'E']]

Reordering Columns

Explicit Order

Reorder columns by listing them in desired order:

Reordering columns
df = pd.DataFrame({
    'age': [25, 30],
    'name': ['Alice', 'Bob'],
    'city': ['NYC', 'LA']
})

# Reorder to: name, age, city
df = df[['name', 'age', 'city']]
#     name  age city
# 0  Alice   25  NYC
# 1    Bob   30   LA

Move Specific Column to Front

Move column to beginning
# Move 'age' to first position
cols = ['age'] + [col for col in df.columns if col != 'age']
df = df[cols]


# Or using a function
def move_column_to_front(df, col_name):
    cols = [col_name] + [col for col in df.columns if col != col_name]
    return df[cols]


df = move_column_to_front(df, 'city')

Sort Columns Alphabetically

Sort columns by name
# Alphabetical order
df = df[sorted(df.columns)]

# Reverse alphabetical
df = df[sorted(df.columns, reverse=True)]

Advanced Column Selection

Dynamic Column Selection

Select columns programmatically:

Dynamic column selection based on logic
df = pd.DataFrame({
    'id': [1, 2, 3],
    'user_name': ['Alice', 'Bob', 'Charlie'],
    'user_age': [25, 30, 35],
    'user_city': ['NYC', 'LA', 'Chicago'],
    'admin_role': ['yes', 'no', 'yes']
})

# All columns starting with 'user_'
user_cols = [col for col in df.columns if col.startswith('user_')]
df[user_cols]
#    user_name  user_age user_city
# 0      Alice        25       NYC
# 1        Bob        30        LA
# 2    Charlie        35   Chicago

# All columns except 'id'
non_id_cols = [col for col in df.columns if col != 'id']
df[non_id_cols]

# Columns with specific keywords
keywords = ['name', 'age']
relevant_cols = [col for col in df.columns if any(kw in col for kw in keywords)]
df[relevant_cols]

Column Selection with Index

Sometimes you need to work with column positions:

Column selection by index position
# Get column names by position
first_col = df.columns[0]
last_col = df.columns[-1]

# Select first 3 columns by position
first_three = df[df.columns[:3]]

# Select every other column
every_other = df[df.columns[::2]]

# Select specific positions
positions = [0, 2, 4]
selected = df[df.columns[positions]]

Combining Selection Methods

Filter Rows and Select Columns

Combine boolean filtering with column selection:

Combined row and column selection
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, 30, 35, 28],
    'city': ['NYC', 'LA', 'NYC', 'Chicago'],
    'salary': [50000, 60000, 70000, 55000]
})

# Method 1: Using loc (recommended)
df.loc[df['age'] > 28, ['name', 'salary']]
#       name  salary
# 1      Bob   60000
# 2  Charlie   70000

# Method 2: Filter then select
filtered = df[df['age'] > 28]
result = filtered[['name', 'salary']]

# Method 3: Chain operations
result = df[df['age'] > 28][['name', 'salary']]

The loc approach is generally cleaner and more efficient.

Type-Based Operations

Operations on specific column types
df = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    'age': [25, 30],
    'salary': [50000, 60000],
    'bonus': [5000, 6000]
})

# Apply operation only to numeric columns
numeric_cols = df.select_dtypes(include=['number'])
df[numeric_cols.columns] = numeric_cols * 1.1  # 10% increase

# Or more directly
df[df.select_dtypes(include=['number']).columns] *= 1.1

Common Patterns

Exclude Certain Columns

Exclude specific columns
df = pd.DataFrame({
    'A': [1, 2], 'B': [3, 4], 'C': [5, 6], 'D': [7, 8]
})

# Keep everything except B and D
exclude = ['B', 'D']
keep_cols = [col for col in df.columns if col not in exclude]
df[keep_cols]
#    A  C
# 0  1  5
# 1  2  6

# Or use difference
keep_cols = df.columns.difference(exclude)
df[keep_cols]

Select Columns Matching Multiple Patterns

OR pattern matching
df = pd.DataFrame({
    'sales_q1': [100], 'sales_q2': [200],
    'cost_q1': [50], 'cost_q2': [80],
    'profit_q1': [50], 'profit_q2': [120]
})

# Columns containing 'sales' OR 'profit'
cols = [col for col in df.columns if 'sales' in col or 'profit' in col]
df[cols]
#    sales_q1  sales_q2  profit_q1  profit_q2
# 0       100       200         50        120

# Using regex with filter
df.filter(regex='sales|profit')  # Same result

Rename While Selecting

Select and rename simultaneously
# Using rename with column selection
df = df[['old_name1', 'old_name2']].rename(columns={
    'old_name1': 'new_name1',
    'old_name2': 'new_name2'
})

# Or select then rename
selected = df[['A', 'B']]
selected.columns = ['Column_A', 'Column_B']

Performance Tips

Column Access Performance

Performance considerations
# For single value access in a column
# Fast: direct column access
df['col'].iloc[0]

# Slower: double indexing
df[['col']].iloc[0, 0]

# For repeated access, cache column reference
# Slow in loop:
for i in range(len(df)):
    value = df['expensive_column'][i]  # Repeated lookup

# Better:
col = df['expensive_column']  # Reference once
for i in range(len(col)):
    value = col[i]

# Best: avoid loops, use vectorized operations
values = df['expensive_column'].values  # NumPy array

Minimize Column Selection Operations

Efficient column selection
# Less efficient: Multiple selections
temp1 = df[['A', 'B']]
temp2 = temp1[temp1['A'] > 5]
result = temp2[['B']]

# More efficient: Single operation
result = df.loc[df['A'] > 5, ['B']]

Common Mistakes

Single vs Double Brackets

Bracket confusion
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})

# Single bracket - returns Series
s = df['A']
type(s)  # Series

# Double bracket - returns DataFrame
df_single = df[['A']]
type(df_single)  # DataFrame

# This matters for method chaining
df['A'].mean()  # Works - Series has mean()
df[['A']].mean()  # Also works - DataFrame has mean()

# But output differs
df['A'].mean()  # Returns scalar: 1.5
df[['A']].mean()  # Returns Series with one value: A  1.5

Dot Notation Pitfalls

Dot notation problems
# This doesn't create a column!
df.new_col = [1, 2, 3]  # Creates DataFrame attribute, not column

# This does:
df['new_col'] = [1, 2, 3]

# Can't use dot notation for assignment of new columns
# Always use brackets for setting columns

Modifying Copies vs Views

Copy vs view issue
# This might be a view or a copy
subset = df[['A', 'B']]
subset['A'] = 999  # Might trigger SettingWithCopyWarning

# Be explicit
subset = df[['A', 'B']].copy()
subset['A'] = 999  # Safe, won't affect df

Quick Reference

Single column (returns Series):

df['col']
df.col  # Dot notation (limited)
df.loc[:, 'col']
df.iloc[:, 0]

Multiple columns (returns DataFrame):

df[['col1', 'col2']]
df.loc[:, ['col1', 'col2']]
df.iloc[:, [0, 1]]

By type:

df.select_dtypes(include=['number'])
df.select_dtypes(exclude=['object'])

By pattern:

df.filter(like='sales')
df.filter(regex='^cost_')

Reorder:

df[['C', 'A', 'B']]  # Explicit order
df[sorted(df.columns)]  # Alphabetical

Drop:

df.drop('col', axis=1)
df.drop(columns=['col1', 'col2'])

Best practices:

Use [] brackets for robust code (not dot notation)
Use [[]] double brackets to get DataFrame (not Series)
Use loc when combining row filtering with column selection
Use select_dtypes() for type-based operations
Always use .copy() when modifying a subset

Overview​

Single Column Selection​

Bracket Notation (Recommended)​

Dot Notation (Convenient but Limited)​

With loc​

Multiple Column Selection​

Double Brackets (Returns DataFrame)​

Column Order​

With loc​

Column Slicing​

Slice by Label​

Slice by Position​

Column Selection by Type​

select_dtypes()​

Exclude Data Types​

Multiple Types​

Column Selection by Pattern​

filter() Method​

List Comprehension​

Conditional Column Selection​

Select Based on Column Content​

Select Based on Statistics​

Dropping Columns​

drop() Method​

Selecting What to Keep​

Reordering Columns​

Explicit Order​

Move Specific Column to Front​

Sort Columns Alphabetically​

Advanced Column Selection​

Dynamic Column Selection​

Column Selection with Index​

Combining Selection Methods​

Filter Rows and Select Columns​

Type-Based Operations​

Common Patterns​

Exclude Certain Columns​

Select Columns Matching Multiple Patterns​

Rename While Selecting​

Performance Tips​

Column Access Performance​

Minimize Column Selection Operations​

Common Mistakes​

Single vs Double Brackets​

Dot Notation Pitfalls​

Modifying Copies vs Views​

Quick Reference​

Overview

Single Column Selection

Bracket Notation (Recommended)

Dot Notation (Convenient but Limited)

With loc

Multiple Column Selection

Double Brackets (Returns DataFrame)

Column Order

With loc

Column Slicing

Slice by Label

Slice by Position

Column Selection by Type

select_dtypes()

Exclude Data Types

Multiple Types

Column Selection by Pattern

filter() Method

List Comprehension

Conditional Column Selection

Select Based on Column Content

Select Based on Statistics

Dropping Columns

drop() Method

Selecting What to Keep

Reordering Columns

Explicit Order

Move Specific Column to Front

Sort Columns Alphabetically

Advanced Column Selection

Dynamic Column Selection

Column Selection with Index

Combining Selection Methods

Filter Rows and Select Columns

Type-Based Operations

Common Patterns

Exclude Certain Columns

Select Columns Matching Multiple Patterns

Rename While Selecting

Performance Tips

Column Access Performance

Minimize Column Selection Operations

Common Mistakes

Single vs Double Brackets

Dot Notation Pitfalls

Modifying Copies vs Views

Quick Reference