Select multiple columns by labels in pandas

Posted on

Question :

Select multiple columns by labels in pandas

I’ve been looking around for ways to select columns through the python documentation and the forums but every example on indexing columns are too simplistic.

Suppose I have a 10 x 10 dataframe

df = DataFrame(randn(10, 10), index=range(0,10), columns=['A', 'B', 'C', 'D','E','F','G','H','I','J'])

So far, all the documentations gives is just a simple example of indexing like

subset = df.loc[:,'A':'C']

or

subset = df.loc[:,'C':]

But I get an error when I try index multiple, non-sequential columns, like this

subset = df.loc[:,('A':'C', 'E')]

How would I index in Pandas if I wanted to select column A to C, E, and G to I? It appears that this logic will not work

subset = df.loc[:,('A':'C', 'E', 'G':'I')]

I feel that the solution is pretty simple, but I can’t get around this error. Thanks!

Asked By: Minh Mai

||

Answer #1:

Name- or Label-Based (using regular expression syntax)

df.filter(regex='[A-CEG-I]')   # does NOT depend on the column order

Note that any regular expression is allowed here, so this approach can be very general. E.g. if you wanted all columns starting with a capital or lowercase “A” you could use: df.filter(regex='^[Aa]')

Location-Based (depends on column order)

df[ list(df.loc[:,'A':'C']) + ['E'] + list(df.loc[:,'G':'I']) ]

Note that unlike the label-based method, this only works if your columns are alphabetically sorted. This is not necessarily a problem, however. For example, if your columns go ['A','C','B'], then you could replace 'A':'C' above with 'A':'B'.

The Long Way

And for completeness, you always have the option shown by @Magdalena of simply listing each column individually, although it could be much more verbose as the number of columns increases:

df[['A','B','C','E','G','H','I']]   # does NOT depend on the column order

Results for any of the above methods

          A         B         C         E         G         H         I
0 -0.814688 -1.060864 -0.008088  2.697203 -0.763874  1.793213 -0.019520
1  0.549824  0.269340  0.405570 -0.406695 -0.536304 -1.231051  0.058018
2  0.879230 -0.666814  1.305835  0.167621 -1.100355  0.391133  0.317467
Answered By: JohnE

Answer #2:

Just pick the columns you want directly….

df[['A','E','I','C']]
Answered By: Magdalena

Answer #3:

How do I select multiple columns by labels in pandas?

Multiple label-based range slicing is not easily supported with pandas, but position-based slicing is, so let’s try that instead:

loc = df.columns.get_loc
df.iloc[:, np.r_[loc('A'):loc('C')+1, loc('E'), loc('G'):loc('I')+1]]

          A         B         C         E         G         H         I
0 -1.666330  0.321260 -1.768185 -0.034774  0.023294  0.533451 -0.241990
1  0.911498  3.408758  0.419618 -0.462590  0.739092  1.103940  0.116119
2  1.243001 -0.867370  1.058194  0.314196  0.887469  0.471137 -1.361059
3 -0.525165  0.676371  0.325831 -1.152202  0.606079  1.002880  2.032663
4  0.706609 -0.424726  0.308808  1.994626  0.626522 -0.033057  1.725315
5  0.879802 -1.961398  0.131694 -0.931951 -0.242822 -1.056038  0.550346
6  0.199072  0.969283  0.347008 -2.611489  0.282920 -0.334618  0.243583
7  1.234059  1.000687  0.863572  0.412544  0.569687 -0.684413 -0.357968
8 -0.299185  0.566009 -0.859453 -0.564557 -0.562524  0.233489 -0.039145
9  0.937637 -2.171174 -1.940916 -1.553634  0.619965 -0.664284 -0.151388

Note that the +1 is added because when using iloc the rightmost index is exclusive.


Comments on Other Solutions

  • filter is a nice and simple method for OP’s headers, but this might not generalise well to arbitrary column names.

  • The “location-based” solution with loc is a little closer to the ideal, but you cannot avoid creating intermediate DataFrames (that are eventually thrown out and garbage collected) to compute the final result range — something that we would ideally like to avoid.

  • Lastly, “pick your columns directly” is good advice as long as you have a manageably small number of columns to pick. It will, however not be applicable in some cases where ranges span dozens (or possibly hundreds) of columns.

Answered By: cs95

Leave a Reply

Your email address will not be published. Required fields are marked *