# How to summarize on different groupby combinations?

Posted on

### Question :

How to summarize on different groupby combinations?

I am compiling a table of top-3 crops by county. Some counties have the same crop varieties in the same order. Other counties have the same crop varieties in a different order.

``````df1 = pd.DataFrame( {
"County" : ["Harney", "Baker", "Wheeler", "Hood River", "Wasco" , "Morrow","Union","Lake"] ,
"Crop1" : ["grain", "melons", "melons", "apples", "pears", "raddish","pears","pears"],
"Crop2" : ["melons","grain","grain","melons","carrots","pears","carrots","carrots"],
"Total_pop": [2000,1500,3000,1500,2000,2500,2700,2000]} )
``````

I can do a groupby on Crop1, Crop2 and Crop3 and get the sum of total_pop:

``````df1_grouped=df1.groupby(['Crop1',"Crop2","Crop3"])['Total_pop'].sum().reset_index()
``````

That gives me the total for specific combinations of the crops:

``````df1_grouped
apples  melons  grain   1500
grain   melons  apples  2000
melons  grain   apples  4500
``````

What I would like, though, is to get the total population on different combinations of crops — irrespective of whether the listed crop was crop1, crop2, or crop3. The desired result would be this:

``````apples  melons   grain    8000
``````

Thank you for any guidance.

Method 1:

Combine the `crop` columns

``````>>> df1['combined_temp'] = df1.apply(lambda x : list([x['Crop1'],
...                           x['Crop2'],
...                           x['Crop3']]),axis=1)
County   Crop1    Crop2    Crop3  Total_pop              combined_temp
0      Harney   grain   melons   apples       2000    [grain, melons, apples]
1       Baker  melons    grain   apples       1500    [melons, grain, apples]
2     Wheeler  melons    grain   apples       3000    [melons, grain, apples]
3  Hood River  apples   melons    grain       1500    [apples, melons, grain]
``````

make it a sorted tuple

``````>>> df1['sorted'] = df1.apply(lambda x : tuple(sorted(x['combined_temp'])),axis=1)
County   Crop1    Crop2            ...             Total_pop              combined_temp                     sorted
0      Harney   grain   melons            ...                  2000    [grain, melons, apples]    (apples, grain, melons)
1       Baker  melons    grain            ...                  1500    [melons, grain, apples]    (apples, grain, melons)
2     Wheeler  melons    grain            ...                  3000    [melons, grain, apples]    (apples, grain, melons)
3  Hood River  apples   melons            ...                  1500    [apples, melons, grain]    (apples, grain, melons)
4       Wasco   pears  carrots            ...                  2000  [pears, carrots, raddish]  (carrots, pears, raddish)
``````

then proceed to your normal group by operation

``````>>> df1_grouped = df1.groupby(['sorted'])['Total_pop'].sum().reset_index()
>>> df1_grouped
sorted  Total_pop
0    (apples, grain, melons)       8000
``````

Method 2:
A shorted version based on the answer by aws-apprentice

``````df = df1.copy()

grouping_cols = ['Crop1', 'Crop2', 'Crop3']

df[grouping_cols] = pd.DataFrame(df.loc[:, grouping_cols]
.apply(set, axis=1)
.apply(sorted)
.values
.tolist(), columns=grouping_cols)

County    Crop1  Crop2    Crop3  Total_pop
0      Harney   apples  grain   melons       2000
1       Baker   apples  grain   melons       1500
2     Wheeler   apples  grain   melons       3000
3  Hood River   apples  grain   melons       1500
4       Wasco  carrots  pears  raddish       2000
``````

now take group by group by

``````>>> df.groupby(grouping_cols).Total_pop.sum()
Crop1    Crop2  Crop3
apples   grain  melons     8000
Name: Total_pop, dtype: int64
``````

but i personally prefer this answer using numpy

Since your data seem to guarantee 3 unique crops per country (“I am compiling a table of top-3 crops by county.”), it suffices to sort the values and assign back.

``````import numpy as np

cols = ['Crop1', 'Crop2', 'Crop3']
df1[cols] = np.sort(df1[cols].to_numpy(), axis=1)

County    Crop1  Crop2    Crop3  Total_pop
0      Harney   apples  grain   melons       2000
1       Baker   apples  grain   melons       1500
2     Wheeler   apples  grain   melons       3000
3  Hood River   apples  grain   melons       1500
4       Wasco  carrots  pears  raddish       2000
5      Morrow  carrots  pears  raddish       2500
6       Union  carrots  pears  raddish       2700
7        Lake  carrots  pears  raddish       2000
``````

Then to summarize:

``````df1.groupby(cols).sum()

#                       Total_pop
#Crop1   Crop2 Crop3
#apples  grain melons        8000
``````

The benefit is that you avoid `Series.apply` or `.apply(axis=1)`. For larger `DataFrames`, the performance difference is noticeable:

``````df1 = pd.concat([df1]*10000, ignore_index=True)

cols = ['Crop1', 'Crop2', 'Crop3']
%timeit df1[cols] = np.sort(df1[cols].to_numpy(), axis=1)
#36.1 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

to_sum = ['Crop1', 'Crop2', 'Crop3']
%timeit df1[to_sum] = pd.DataFrame(df1.loc[:, to_sum].apply(set, axis=1).apply(list).values.tolist(), columns=to_sum)
#1.41 s ± 51.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
``````

Here is one way to do it.

First let’s get the unique values across the columns and then reassign these values back to the DataFrame. We will perform this on a copy of the original data since you might need to preserve the original data.

``````df = df1.copy()

to_sum = ['Crop1', 'Crop2', 'Crop3']

df[to_sum] = pd.DataFrame(df.loc[:, to_sum]
.apply(set, axis=1)
.apply(sorted)
.values
.tolist(), columns=to_sum)

print(df)

County  Crop1    Crop2    Crop3  Total_pop
0      Harney  grain   apples   melons       2000
1       Baker  grain   apples   melons       1500
2     Wheeler  grain   apples   melons       3000
3  Hood River  grain   apples   melons       1500
4       Wasco  pears  carrots  raddish       2000
5      Morrow  pears  carrots  raddish       2500
6       Union  pears  carrots  raddish       2700
7        Lake  pears  carrots  raddish       2000
``````

Now we can perform our `groupby` to get the desired results.

``````df.groupby(to_sum).Total_pop.sum()

Crop1    Crop2  Crop3
apples   grain  melons     8000
Name: Total_pop, dtype: int64
``````

### `np.bincount`

``````i, u = pd.factorize([*map(frozenset, zip(df1.Crop1, df1.Crop2, df1.Crop3))])
s = np.bincount(i, df1.Total_pop)

pd.Series(s, u)

(melons, grain, apples)      8000.0
dtype: float64
``````

Or, if you want separate columns

``````pd.Series(dict(zip(map(tuple, u), s)))

melons   grain    apples    8000.0
dtype: float64
``````

And fully pretty

``````pd.Series(dict(zip(map(tuple, u), s)))
.rename_axis(['Crop1', 'Crop2', 'Crop3']).reset_index(name='Total_pop')

Crop1    Crop2   Crop3  Total_pop
0   melons    grain  apples     8000.0
``````

``````import pandas as pd

df = pd.DataFrame( {
"County" : ["Harney", "Baker", "Wheeler", "Hood River", "Wasco" , "Morrow","Union","Lake"] ,
"Crop1" : ["grain", "melons", "melons", "apples", "pears", "raddish","pears","pears"],
"Crop2" : ["melons","grain","grain","melons","carrots","pears","carrots","carrots"],
"Total_pop": [2000,1500,3000,1500,2000,2500,2700,2000]} )
print(df)
df["Merged"] = df[["Crop1", "Crop2", "Crop3"]].apply(lambda x: ','.join(x.dropna().astype(str).values), axis=1).str.split(",")
df["Merged"] = df["Merged"].sort_values().apply(lambda x: sorted(x)).apply(lambda x: ",".join(x))
df[["x", "y", "z"]] = df["Merged"].str.split(",", expand=True)
df1=df.groupby(['x',"y","z"])['Total_pop'].sum().reset_index()
print(df1)
``````

Output:

``````      County    Crop1    Crop2    Crop3  Total_pop
Harney    grain   melons   apples       2000
Baker   melons    grain   apples       1500
Wheeler   melons    grain   apples       3000
Hood River   apples   melons    grain       1500