# Get the row(s) which have the max value in groups using groupby

Posted on

Solving problem is about exposing yourself to as many situations as possible like Get the row(s) which have the max value in groups using groupby and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about Get the row(s) which have the max value in groups using groupby, which can be followed any time. Take easy to follow this discuss.

Get the row(s) which have the max value in groups using groupby

How do I find all rows in a pandas data frame which have the max value for `count` column, after grouping by `['Sp','Mt']` columns?

Example 1: the following dataFrame, which I group by `['Sp','Mt']`:

``````   Sp   Mt Value   count
0  MM1  S1   a     **3**
1  MM1  S1   n       2
2  MM1  S3   cb    **5**
3  MM2  S3   mk    **8**
4  MM2  S4   bg    **10**
5  MM2  S4   dgd     1
6  MM4  S2   rd      2
7  MM4  S2   cb      2
8  MM4  S2   uyi   **7**
``````

Expected output: get the result rows whose count is max between the groups, like:

``````0  MM1  S1   a      **3**
2  MM1  S3   cb     **5**
3  MM2  S3   mk     **8**
4  MM2  S4   bg     **10**
8  MM4  S2   uyi    **7**
``````

Example 2: this dataframe, which I group by `['Sp','Mt']`:

``````   Sp   Mt   Value  count
4  MM2  S4   bg     10
5  MM2  S4   dgd    1
6  MM4  S2   rd     2
7  MM4  S2   cb     8
8  MM4  S2   uyi    8
``````

For the above example, I want to get all the rows where `count` equals max, in each group e.g :

``````MM2  S4   bg     10
MM4  S2   cb     8
MM4  S2   uyi    8
``````

``````In [1]: df
Out[1]:
Sp  Mt Value  count
0  MM1  S1     a      3
1  MM1  S1     n      2
2  MM1  S3    cb      5
3  MM2  S3    mk      8
4  MM2  S4    bg     10
5  MM2  S4   dgd      1
6  MM4  S2    rd      2
7  MM4  S2    cb      2
8  MM4  S2   uyi      7
In [2]: df.groupby(['Mt'], sort=False)['count'].max()
Out[2]:
Mt
S1     3
S3     8
S4    10
S2     7
Name: count
``````

To get the indices of the original DF you can do:

``````In [3]: idx = df.groupby(['Mt'])['count'].transform(max) == df['count']
In [4]: df[idx]
Out[4]:
Sp  Mt Value  count
0  MM1  S1     a      3
3  MM2  S3    mk      8
4  MM2  S4    bg     10
8  MM4  S2   uyi      7
``````

Note that if you have multiple max values per group, all will be returned.

Update

On a hail mary chance that this is what the OP is requesting:

``````In [5]: df['count_max'] = df.groupby(['Mt'])['count'].transform(max)
In [6]: df
Out[6]:
Sp  Mt Value  count  count_max
0  MM1  S1     a      3          3
1  MM1  S1     n      2          3
2  MM1  S3    cb      5          8
3  MM2  S3    mk      8          8
4  MM2  S4    bg     10         10
5  MM2  S4   dgd      1         10
6  MM4  S2    rd      2          7
7  MM4  S2    cb      2          7
8  MM4  S2   uyi      7          7
``````

You can sort the dataFrame by count and then remove duplicates. I think it’s easier:

``````df.sort_values('count', ascending=False).drop_duplicates(['Sp','Mt'])
``````

Easy solution would be to apply : idxmax() function to get indices of rows with max values.
This would filter out all the rows with max value in the group.

``````In [365]: import pandas as pd
In [366]: df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
In [367]: df
Out[367]:
count  mt   sp  val
0      3  S1  MM1    a
1      2  S1  MM1    n
2      5  S3  MM1   cb
3      8  S3  MM2   mk
4     10  S4  MM2   bg
5      1  S4  MM2  dgb
6      2  S2  MM4   rd
7      2  S2  MM4   cb
8      7  S2  MM4  uyi
### Apply idxmax() and use .loc() on dataframe to filter the rows with max values:
In [368]: df.loc[df.groupby(["sp", "mt"])["count"].idxmax()]
Out[368]:
count  mt   sp  val
0      3  S1  MM1    a
2      5  S3  MM1   cb
3      8  S3  MM2   mk
4     10  S4  MM2   bg
8      7  S2  MM4  uyi
### Just to show what values are returned by .idxmax() above:
In [369]: df.groupby(["sp", "mt"])["count"].idxmax().values
Out[369]: array([0, 2, 3, 4, 8])
``````

Having tried the solution suggested by Zelazny on a relatively large DataFrame (~400k rows) I found it to be very slow. Here is an alternative that I found to run orders of magnitude faster on my data set.

``````df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
df_grouped = df.groupby(['sp', 'mt']).agg({'count':'max'})
df_grouped = df_grouped.reset_index()
df_grouped = df_grouped.rename(columns={'count':'count_max'})
df = pd.merge(df, df_grouped, how='left', on=['sp', 'mt'])
df = df[df['count'] == df['count_max']]
``````

You may not need to do with group by , using `sort_values`+ `drop_duplicates`

``````df.sort_values('count').drop_duplicates(['Sp','Mt'],keep='last')
Out[190]:
Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
8  MM4  S2   uyi      7
3  MM2  S3    mk      8
4  MM2  S4    bg     10
``````

Also almost same logic by using `tail`

``````df.sort_values('count').groupby(['Sp', 'Mt']).tail(1)
Out[52]:
Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
8  MM4  S2   uyi      7
3  MM2  S3    mk      8
4  MM2  S4    bg     10
``````

For me, the easiest solution would be keep value when count is equal to the maximum. Therefore, the following one line command is enough :

``````df[df['count'] == df.groupby(['Mt'])['count'].transform(max)]
``````

Use `groupby` and `idxmax` methods:

1. transfer col `date` to `datetime`:

``````df['date']=pd.to_datetime(df['date'])
``````
2. get the index of `max` of column `date`, after `groupyby ad_id`:

``````idx=df.groupby(by='ad_id')['date'].idxmax()
``````
3. get the wanted data:

``````df_max=df.loc[idx,]
``````

Out[54]:

``````ad_id  price       date
7     22      2 2018-06-11
6     23      2 2018-06-22
2     24      2 2018-06-30
3     28      5 2018-06-22
``````

Try using “nlargest” on the groupby object. The advantage of using nlargest is that it returns the index of the rows where “the nlargest item(s)” were fetched from.
Note: we slice the second(1) element of our index since our index in this case consist of tuples(eg.(s1, 0)).

``````df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})
d = df.groupby('mt')['count'].nlargest(1) # pass 1 since we want the max
df.iloc[[i[1] for i in d.index], :] # pass the index of d as list comprehension
``````

enter image description here