Selecting/excluding sets of columns in pandas [duplicate]

Posted on

Question :

Selecting/excluding sets of columns in pandas [duplicate]

I would like to create views or dataframes from an existing dataframe based on column selections.

For example, I would like to create a dataframe df2 from a dataframe df1 that holds all columns from it except two of them. I tried doing the following, but it didn’t work:

import numpy as np
import pandas as pd

# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

# Try to create a second dataframe df2 from df with all columns except 'B' and D
my_cols = set(df.columns)
my_cols.remove('B').remove('D')

# This returns an error ("unhashable type: set")
df2 = df[my_cols]

What am I doing wrong? Perhaps more generally, what mechanisms does pandas have to support the picking and exclusions of arbitrary sets of columns from a dataframe?

Answer #1:

You can either Drop the columns you do not need OR Select the ones you need

# Using DataFrame.drop
df.drop(df.columns[[1, 2]], axis=1, inplace=True)

# drop by Name
df1 = df1.drop(['B', 'C'], axis=1)

# Select the ones you want
df1 = df[['a','d']]
Answered By: Amrita Sawant

Answer #2:

There is a new index method called difference. It returns the original columns, with the columns passed as argument removed.

Here, the result is used to remove columns B and D from df:

df2 = df[df.columns.difference(['B', 'D'])]

Note that it’s a set-based method, so duplicate column names will cause issues, and the column order may be changed.


Advantage over drop: you don’t create a copy of the entire dataframe when you only need the list of columns. For instance, in order to drop duplicates on a subset of columns:

# may create a copy of the dataframe
subset = df.drop(['B', 'D'], axis=1).columns

# does not create a copy the dataframe
subset = df.columns.difference(['B', 'D'])

df = df.drop_duplicates(subset=subset)
Answered By: IanS

Answer #3:

Another option, without dropping or filtering in a loop:

import numpy as np
import pandas as pd

# Create a dataframe with columns A,B,C and D
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))

# include the columns you want
df[df.columns[df.columns.isin(['A', 'B'])]]

# or more simply include columns:
df[['A', 'B']]

# exclude columns you don't want
df[df.columns[~df.columns.isin(['C','D'])]]

# or even simpler since 0.24
# with the caveat that it reorders columns alphabetically 
df[df.columns.difference(['C', 'D'])]
Answered By: MrE

Answer #4:

You don’t really need to convert that into a set:

cols = [col for col in df.columns if col not in ['B', 'D']]
df2 = df[cols]
Answered By: piggybox

Answer #5:

Also have a look into the built-in DataFrame.filter function.

Minimalistic but greedy approach (sufficient for the given df):

df.filter(regex="[^BD]")

Conservative/lazy approach (exact matches only):

df.filter(regex="^(?!(B|D)$).*$")

Conservative and generic:

exclude_cols = ['B','C']
df.filter(regex="^(?!({0})$).*$".format('|'.join(exclude_cols)))
Answered By: Frank

Answer #6:

You just need to convert your set to a list

import pandas as pd
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
my_cols = set(df.columns)
my_cols.remove('B')
my_cols.remove('D')
my_cols = list(my_cols)
df2 = df[my_cols]
Answered By: tacaswell

Answer #7:

Here’s how to create a copy of a DataFrame excluding a list of columns:

df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df2 = df.drop(['B', 'D'], axis=1)

But be careful! You mention views in your question, suggesting that if you changed df, you’d want df2 to change too. (Like a view would in a database.)

This method doesn’t achieve that:

>>> df.loc[0, 'A'] = 999 # Change the first value in df
>>> df.head(1)
     A         B         C         D
0  999 -0.742688 -1.980673 -0.920133
>>> df2.head(1) # df2 is unchanged. It's not a view, it's a copy!
          A         C
0  0.251262 -1.980673

Note also that this is also true of @piggybox’s method. (Although that method is nice and slick and Pythonic. I’m not doing it down!!)

For more on views vs. copies see this SO answer and this part of the Pandas docs which that answer refers to.

Answered By: LondonRob

Answer #8:

You have 4 columns A,B,C,D

Here is a better way to select the columns you need for the new dataframe:-

df2 = df1[['A','D']]

if you wish to use column numbers instead, use:-

df2 = df1[[0,3]]
Answered By: Kapil Marwaha

Leave a Reply

Your email address will not be published. Required fields are marked *