Python Pandas How to assign groupby operation results back to columns in parent dataframe?

Posted on

Question :

Python Pandas How to assign groupby operation results back to columns in parent dataframe?

I have the following data frame in IPython, where each row is a single stock:

In [261]: bdata
Out[261]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21210 entries, 0 to 21209
Data columns:
BloombergTicker      21206  non-null values
Company              21210  non-null values
Country              21210  non-null values
MarketCap            21210  non-null values
PriceReturn          21210  non-null values
SEDOL                21210  non-null values
yearmonth            21210  non-null values
dtypes: float64(2), int64(1), object(4)

I want to apply a groupby operation that computes cap-weighted average return across everything, per each date in the “yearmonth” column.

This works as expected:

In [262]: bdata.groupby("yearmonth").apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())
Out[262]:
yearmonth
201204      -0.109444
201205      -0.290546

But then I want to sort of “broadcast” these values back to the indices in the original data frame, and save them as constant columns where the dates match.

In [263]: dateGrps = bdata.groupby("yearmonth")

In [264]: dateGrps["MarketReturn"] = dateGrps.apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/mnt/bos-devrnd04/usr6/home/espears/ws/Research/Projects/python-util/src/util/<ipython-input-264-4a68c8782426> in <module>()
----> 1 dateGrps["MarketReturn"] = dateGrps.apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())

TypeError: 'DataFrameGroupBy' object does not support item assignment

I realize this naive assignment should not work. But what is the “right” Pandas idiom for assigning the result of a groupby operation into a new column on the parent dataframe?

In the end, I want a column called “MarketReturn” than will be a repeated constant value for all indices that have matching date with the output of the groupby operation.

One hack to achieve this would be the following:

marketRetsByDate  = dateGrps.apply(lambda x: (x["PriceReturn"]*x["MarketCap"]/x["MarketCap"].sum()).sum())

bdata["MarketReturn"] = np.repeat(np.NaN, len(bdata))

for elem in marketRetsByDate.index.values:
    bdata["MarketReturn"][bdata["yearmonth"]==elem] = marketRetsByDate.ix[elem]

But this is slow, bad, and unPythonic.

Asked By: ely

||

Answer #1:

In [97]: df = pandas.DataFrame({'month': np.random.randint(0,11, 100), 'A': np.random.randn(100), 'B': np.random.randn(100)})

In [98]: df.join(df.groupby('month')['A'].sum(), on='month', rsuffix='_r')
Out[98]:
           A         B  month       A_r
0  -0.040710  0.182269      0 -0.331816
1  -0.004867  0.642243      1  2.448232
2  -0.162191  0.442338      4  2.045909
3  -0.979875  1.367018      5 -2.736399
4  -1.126198  0.338946      5 -2.736399
5  -0.992209 -1.343258      1  2.448232
6  -1.450310  0.021290      0 -0.331816
7  -0.675345 -1.359915      9  2.722156
Answered By: ely

Answer #2:

While I’m still exploring all of the incredibly smart ways that apply concatenates the pieces it’s given, here’s another way to add a new column in the parent after a groupby operation.

In [236]: df
Out[236]: 
  yearmonth    return
0    201202  0.922132
1    201202  0.220270
2    201202  0.228856
3    201203  0.277170
4    201203  0.747347

In [237]: def add_mkt_return(grp):
   .....:     grp['mkt_return'] = grp['return'].sum()
   .....:     return grp
   .....: 

In [238]: df.groupby('yearmonth').apply(add_mkt_return)
Out[238]: 
  yearmonth    return  mkt_return
0    201202  0.922132    1.371258
1    201202  0.220270    1.371258
2    201202  0.228856    1.371258
3    201203  0.277170    1.024516
4    201203  0.747347    1.024516
Answered By: Wouter Overmeire

Answer #3:

As a general rule when using groupby(), if you use the .transform() function pandas will return a table with the same length as your original. When you use other functions like .sum() or .first() then pandas will return a table where each row is a group.

I’m not sure how this works with apply but implementing elaborate lambda functions with transform can be fairly tricky so the strategy that I find most helpful is to create the variables I need, place them in the original dataset and then do my operations there.

If I understand what you’re trying to do correctly first you can calculate the total market cap for each group:

bdata['group_MarketCap'] = bdata.groupby('yearmonth')['MarketCap'].transform('sum')

This will add a column called “group_MarketCap” to your original data which would contain the sum of market caps for each group. Then you can calculate the weighted values directly:

bdata['weighted_P'] = bdata['PriceReturn'] * (bdata['MarketCap']/bdata['group_MarketCap'])

And finally you would calculate the weighted average for each group using the same transform function:

bdata['MarketReturn'] = bdata.groupby('yearmonth')['weighted_P'].transform('sum')

I tend to build my variables this way. Sometimes you can pull off putting it all in a single command but that doesn’t always work with groupby() because most of the time pandas needs to instantiate the new object to operate on it at the full dataset scale (i.e. you can’t add two columns together if one doesn’t exist yet).

Hope this helps 🙂

Answered By: Garrett

Answer #4:

May I suggest the transform method (instead of aggregate)? If you use it in your original example it should do what you want (the broadcasting).

Answered By: seeiespi

Answer #5:

I did not find a way to make assignment to the original dataframe. So I just store the results from the groups and concatenate them. Then we sort the concatenated dataframe by index to get the original order as the input dataframe. Here is a sample code:

In [10]: df = pd.DataFrame({'month': np.random.randint(0,11, 100), 'A': np.random.randn(100), 'B': np.random.randn(100)})

In [11]: df.head()
Out[11]:
   month         A         B
0      4 -0.029106 -0.904648
1      2 -2.724073  0.492751
2      7  0.732403  0.689530
3      2  0.487685 -1.017337
4      1  1.160858 -0.025232

In [12]: res = []

In [13]: for month, group in df.groupby('month'):
    ...:     new_df = pd.DataFrame({
    ...:         'A^2+B': group.A ** 2 + group.B,
    ...:         'A+B^2': group.A + group.B**2
    ...:     })
    ...:     res.append(new_df)
    ...:

In [14]: res = pd.concat(res).sort_index()

In [15]: res.head()
Out[15]:
      A^2+B     A+B^2
0 -0.903801  0.789282
1  7.913327 -2.481270
2  1.225944  1.207855
3 -0.779501  1.522660
4  1.322360  1.161495

This method is pretty fast and extensible. You can derive any feature here.

Answered By: Wes McKinney

Leave a Reply

Your email address will not be published. Required fields are marked *