Solving problem is about exposing yourself to as many situations as possible like Filter dataframe rows if value in column is in a set list of values [duplicate] and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about Filter dataframe rows if value in column is in a set list of values [duplicate], which can be followed any time. Take easy to follow this discuss.
I have a Python pandas DataFrame rpt
:
rpt
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 47518 entries, ('000002', '20120331') to ('603366', '20091231')
Data columns:
STK_ID 47518 non-null values
STK_Name 47518 non-null values
RPT_Date 47518 non-null values
sales 47518 non-null values
I can filter the rows whose stock id is '600809'
like this: rpt[rpt['STK_ID'] == '600809']
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 25 entries, ('600809', '20120331') to ('600809', '20060331')
Data columns:
STK_ID 25 non-null values
STK_Name 25 non-null values
RPT_Date 25 non-null values
sales 25 non-null values
and I want to get all the rows of some stocks together, such as ['600809','600141','600329']
. That means I want a syntax like this:
stk_list = ['600809','600141','600329']
rst = rpt[rpt['STK_ID'] in stk_list] # this does not works in pandas
Since pandas not accept above command, how to achieve the target?
Answer #1:
Use the isin
method:
rpt[rpt['STK_ID'].isin(stk_list)]
Answer #2:
isin()
is ideal if you have a list of exact matches, but if you have a list of partial matches or substrings to look for, you can filter using the str.contains
method and regular expressions.
For example, if we want to return a DataFrame where all of the stock IDs which begin with '600'
and then are followed by any three digits:
>>> rpt[rpt['STK_ID'].str.contains(r'^600[0-9]{3}$')] # ^ means start of string
... STK_ID ... # [0-9]{3} means any three digits
... '600809' ... # $ means end of string
... '600141' ...
... '600329' ...
... ... ...
Suppose now we have a list of strings which we want the values in 'STK_ID'
to end with, e.g.
endstrings = ['01$', '02$', '05$']
We can join these strings with the regex ‘or’ character |
and pass the string to str.contains
to filter the DataFrame:
>>> rpt[rpt['STK_ID'].str.contains('|'.join(endstrings)]
... STK_ID ...
... '155905' ...
... '633101' ...
... '210302' ...
... ... ...
Finally, contains
can ignore case (by setting case=False
), allowing you to be more general when specifying the strings you want to match.
For example,
str.contains('pandas', case=False)
would match PANDAS
, PanDAs
, paNdAs123
, and so on.
Answer #3:
you can also use ranges by using:
b = df[(df['a'] > 1) & (df['a'] < 5)]
Answer #4:
You can also directly query your DataFrame for this information.
rpt.query('STK_ID in (600809,600141,600329)')
Or similarly search for ranges:
rpt.query('60000 < STK_ID < 70000')
Answer #5:
Slicing data with pandas
Given a dataframe like this:
RPT_Date STK_ID STK_Name sales
0 1980-01-01 0 Arthur 0
1 1980-01-02 1 Beate 4
2 1980-01-03 2 Cecil 2
3 1980-01-04 3 Dana 8
4 1980-01-05 4 Eric 4
5 1980-01-06 5 Fidel 5
6 1980-01-07 6 George 4
7 1980-01-08 7 Hans 7
8 1980-01-09 8 Ingrid 7
9 1980-01-10 9 Jones 4
There are multiple ways of selecting or slicing the data.
Using .isin
The most obvious is the .isin
feature. You can create a mask that gives you a series of True
/False
statements, which can be applied to a dataframe like this:
mask = df['STK_ID'].isin([4, 2, 6])
mask
0 False
1 False
2 True
3 False
4 True
5 False
6 True
7 False
8 False
9 False
Name: STK_ID, dtype: bool
df[mask]
RPT_Date STK_ID STK_Name sales
2 1980-01-03 2 Cecil 2
4 1980-01-05 4 Eric 4
6 1980-01-07 6 George 4
Masking is the ad-hoc solution to the problem, but does not always perform well in terms of speed and memory.
With indexing
By setting the index to the STK_ID
column, we can use the pandas builtin slicing object .loc
df.set_index('STK_ID', inplace=True)
RPT_Date STK_Name sales
STK_ID
0 1980-01-01 Arthur 0
1 1980-01-02 Beate 4
2 1980-01-03 Cecil 2
3 1980-01-04 Dana 8
4 1980-01-05 Eric 4
5 1980-01-06 Fidel 5
6 1980-01-07 George 4
7 1980-01-08 Hans 7
8 1980-01-09 Ingrid 7
9 1980-01-10 Jones 4
df.loc[[4, 2, 6]]
RPT_Date STK_Name sales
STK_ID
4 1980-01-05 Eric 4
2 1980-01-03 Cecil 2
6 1980-01-07 George 4
This is the fast way of doing it, even if the indexing can take a little while, it saves time if you want to do multiple queries like this.
Merging dataframes
This can also be done by merging dataframes. This would fit more for a scenario where you have a lot more data than in these examples.
stkid_df = pd.DataFrame({"STK_ID": [4,2,6]})
df.merge(stkid_df, on='STK_ID')
STK_ID RPT_Date STK_Name sales
0 2 1980-01-03 Cecil 2
1 4 1980-01-05 Eric 4
2 6 1980-01-07 George 4
Note
All the above methods work even if there are multiple rows with the same 'STK_ID'
Answer #6:
You can also achieve similar results by using ‘query’ and @:
eg:
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
df = pd.DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 5]})
list_of_values = [3,6]
result= df.query("A in @list_of_values")
result
A B
1 6 2
2 3 3
Answer #7:
You can use query
, i.e.:
b = df.query('a > 1 & a < 5')