# Efficiently select rows that match one of several values in Pandas DataFrame

Posted on

### Question :

Efficiently select rows that match one of several values in Pandas DataFrame

## Problem

Given data in a Pandas DataFrame like the following:

``````Name     Amount
---------------
Alice       100
Bob          50
Charlie     200
Alice        30
Charlie      10
``````

I want to select all rows where the `Name` is one of several values in a collection `{Alice, Bob}`

``````Name     Amount
---------------
Alice       100
Bob          50
Alice        30
``````

## Question

What is an efficient way to do this in Pandas?

Options as I see them

1. Loop through rows, handling the logic with Python
2. Select and merge many statements like the following

``````merge(df[df.name = specific_name] for specific_name in names) # something like this
``````
3. Perform some sort of join

What are the performance trade-offs here? When is one solution better than the others? What solutions am I missing?

While the example above uses strings my actual job uses matches on 10-100 integers over millions of rows and so fast NumPy operations may be relevant.

You can use the isin Series method:

``````In : df['Name'].isin(['Alice', 'Bob'])
Out:
0     True
1     True
2    False
3     True
4    False
Name: Name, dtype: bool

In : df[df.Name.isin(['Alice', 'Bob'])]
Out:
Name  Amount
0  Alice     100
1    Bob      50
3  Alice      30
``````

Since, in your actual use case, the values in `df['Name']` are `ints`, you might be able to generate the boolean mask faster using NumPy indexing instead of `Series.isin`.

``````idx = np.zeros(N, dtype='bool')
idx[names] = True
df[idx[df['Name'].values]]
``````

For example, given this setup:

``````import pandas as pd
import numpy as np

N = 100000
df = pd.DataFrame(np.random.randint(N, size=(10**6, 2)), columns=['Name', 'Amount'])
names = np.random.choice(np.arange(N), size=100, replace=False)
``````

``````In : %timeit idx = np.zeros(N, dtype='bool'); idx[names] = True; df[idx[df['Name'].values]]
100 loops, best of 3: 9.88 ms per loop

In : %timeit df[df.Name.isin(names)]
10 loops, best of 3: 107 ms per loop

In : 107/9.88
Out: 10.82995951417004
``````

`N` is (essentially) the maximum value that `df['Names']` can attain.
If `N` is smaller, the speed benefit is not as large. With `N = 200`,

``````In : %timeit idx = np.zeros(N, dtype='bool'); idx[names] = True; df[idx[df['Name'].values]]
10 loops, best of 3: 62.6 ms per loop

In : %timeit df[df.Name.isin(names)]
10 loops, best of 3: 178 ms per loop

In : 178/62.6
Out: 2.8434504792332267
``````

Caution: As shown above, there seems to be a speed benefit, particularly as `N` gets large. However, if `N` is too large, then forming `idx = np.zeros(N, dtype='bool')` may not be feasible.

Sanity check:

``````expected = df[df.Name.isin(names)]
idx = np.zeros(N, dtype='bool')
idx[names] = True
result = df[idx[df['Name'].values]]
assert expected.equals(result)
``````