Get the row(s) which have the max value in groups using groupby

Posted on

Solving problem is about exposing yourself to as many situations as possible like Get the row(s) which have the max value in groups using groupby and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about Get the row(s) which have the max value in groups using groupby, which can be followed any time. Take easy to follow this discuss.

Get the row(s) which have the max value in groups using groupby

How do I find all rows in a pandas data frame which have the max value for count column, after grouping by ['Sp','Mt'] columns?

Example 1: the following dataFrame, which I group by ['Sp','Mt']:

   Sp   Mt Value   count
0  MM1  S1   a     **3**
1  MM1  S1   n       2
2  MM1  S3   cb    **5**
3  MM2  S3   mk    **8**
4  MM2  S4   bg    **10**
5  MM2  S4   dgd     1
6  MM4  S2   rd      2
7  MM4  S2   cb      2
8  MM4  S2   uyi   **7**

Expected output: get the result rows whose count is max between the groups, like:

0  MM1  S1   a      **3**
2  MM1  S3   cb     **5**
3  MM2  S3   mk     **8**
4  MM2  S4   bg     **10**
8  MM4  S2   uyi    **7**

Example 2: this dataframe, which I group by ['Sp','Mt']:

   Sp   Mt   Value  count
4  MM2  S4   bg     10
5  MM2  S4   dgd    1
6  MM4  S2   rd     2
7  MM4  S2   cb     8
8  MM4  S2   uyi    8

For the above example, I want to get all the rows where count equals max, in each group e.g :

MM2  S4   bg     10
MM4  S2   cb     8
MM4  S2   uyi    8
Asked By: jojo12

||

Answer #1:

In [1]: df
Out[1]:
    Sp  Mt Value  count
0  MM1  S1     a      3
1  MM1  S1     n      2
2  MM1  S3    cb      5
3  MM2  S3    mk      8
4  MM2  S4    bg     10
5  MM2  S4   dgd      1
6  MM4  S2    rd      2
7  MM4  S2    cb      2
8  MM4  S2   uyi      7
In [2]: df.groupby(['Mt'], sort=False)['count'].max()
Out[2]:
Mt
S1     3
S3     8
S4    10
S2     7
Name: count

To get the indices of the original DF you can do:

In [3]: idx = df.groupby(['Mt'])['count'].transform(max) == df['count']
In [4]: df[idx]
Out[4]:
    Sp  Mt Value  count
0  MM1  S1     a      3
3  MM2  S3    mk      8
4  MM2  S4    bg     10
8  MM4  S2   uyi      7

Note that if you have multiple max values per group, all will be returned.

Update

On a hail mary chance that this is what the OP is requesting:

In [5]: df['count_max'] = df.groupby(['Mt'])['count'].transform(max)
In [6]: df
Out[6]:
    Sp  Mt Value  count  count_max
0  MM1  S1     a      3          3
1  MM1  S1     n      2          3
2  MM1  S3    cb      5          8
3  MM2  S3    mk      8          8
4  MM2  S4    bg     10         10
5  MM2  S4   dgd      1         10
6  MM4  S2    rd      2          7
7  MM4  S2    cb      2          7
8  MM4  S2   uyi      7          7
Answered By: Zelazny7

Answer #2:

You can sort the dataFrame by count and then remove duplicates. I think it’s easier:

df.sort_values('count', ascending=False).drop_duplicates(['Sp','Mt'])
Answered By: Rani

Answer #3:

Easy solution would be to apply : idxmax() function to get indices of rows with max values.
This would filter out all the rows with max value in the group.

In [365]: import pandas as pd
In [366]: df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
In [367]: df
Out[367]:
   count  mt   sp  val
0      3  S1  MM1    a
1      2  S1  MM1    n
2      5  S3  MM1   cb
3      8  S3  MM2   mk
4     10  S4  MM2   bg
5      1  S4  MM2  dgb
6      2  S2  MM4   rd
7      2  S2  MM4   cb
8      7  S2  MM4  uyi
### Apply idxmax() and use .loc() on dataframe to filter the rows with max values:
In [368]: df.loc[df.groupby(["sp", "mt"])["count"].idxmax()]
Out[368]:
   count  mt   sp  val
0      3  S1  MM1    a
2      5  S3  MM1   cb
3      8  S3  MM2   mk
4     10  S4  MM2   bg
8      7  S2  MM4  uyi
### Just to show what values are returned by .idxmax() above:
In [369]: df.groupby(["sp", "mt"])["count"].idxmax().values
Out[369]: array([0, 2, 3, 4, 8])
Answered By: Surya

Answer #4:

Having tried the solution suggested by Zelazny on a relatively large DataFrame (~400k rows) I found it to be very slow. Here is an alternative that I found to run orders of magnitude faster on my data set.

df = pd.DataFrame({
    'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
    'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
    'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
    'count' : [3,2,5,8,10,1,2,2,7]
    })
df_grouped = df.groupby(['sp', 'mt']).agg({'count':'max'})
df_grouped = df_grouped.reset_index()
df_grouped = df_grouped.rename(columns={'count':'count_max'})
df = pd.merge(df, df_grouped, how='left', on=['sp', 'mt'])
df = df[df['count'] == df['count_max']]
Answered By: landewednack

Answer #5:

You may not need to do with group by , using sort_values+ drop_duplicates

df.sort_values('count').drop_duplicates(['Sp','Mt'],keep='last')
Out[190]:
    Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
8  MM4  S2   uyi      7
3  MM2  S3    mk      8
4  MM2  S4    bg     10

Also almost same logic by using tail

df.sort_values('count').groupby(['Sp', 'Mt']).tail(1)
Out[52]:
    Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
8  MM4  S2   uyi      7
3  MM2  S3    mk      8
4  MM2  S4    bg     10
Answered By: BENY

Answer #6:

For me, the easiest solution would be keep value when count is equal to the maximum. Therefore, the following one line command is enough :

df[df['count'] == df.groupby(['Mt'])['count'].transform(max)]
Answered By: PAC

Answer #7:

Use groupby and idxmax methods:

  1. transfer col date to datetime:

    df['date']=pd.to_datetime(df['date'])
    
  2. get the index of max of column date, after groupyby ad_id:

    idx=df.groupby(by='ad_id')['date'].idxmax()
    
  3. get the wanted data:

    df_max=df.loc[idx,]
    

Out[54]:

ad_id  price       date
7     22      2 2018-06-11
6     23      2 2018-06-22
2     24      2 2018-06-30
3     28      5 2018-06-22
Answered By: blueear

Answer #8:

Try using “nlargest” on the groupby object. The advantage of using nlargest is that it returns the index of the rows where “the nlargest item(s)” were fetched from.
Note: we slice the second(1) element of our index since our index in this case consist of tuples(eg.(s1, 0)).

df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
d = df.groupby('mt')['count'].nlargest(1) # pass 1 since we want the max
df.iloc[[i[1] for i in d.index], :] # pass the index of d as list comprehension

enter image description here

Answered By: escha

Leave a Reply

Your email address will not be published. Required fields are marked *