pandas groupby and join lists

Posted on

Question :

pandas groupby and join lists

I have a dataframe df, with two columns, I want to groupby one column and join the lists belongs to same group, example:

column_a, column_b
1,         [1,2,3]
1,         [2,5]
2,         [5,6]

after the process:

column_a, column_b
1,         [1,2,3,2,5]
2,         [5,6]

I want to keep all the duplicates. I have the following questions:

  • The dtypes of the dataframe are object(s). convert_objects() doesn’t convert column_b to list automatically. How can I do this?
  • what does the function in df.groupby(…).apply(lambda x: …) apply to ? what is the form of x ? list?
  • the solution to my main problem?

Thanks in advance.

Answer #1:

object dtype is a catch-all dtype that basically means not int, float, bool, datetime, or timedelta. So it is storing them as a list. convert_objects tries to convert a column to one of those dtypes.

You want

In [63]: df
Out[63]: 
   a          b    c
0  1  [1, 2, 3]  foo
1  1     [2, 5]  bar
2  2     [5, 6]  baz


In [64]: df.groupby('a').agg({'b': 'sum', 'c': lambda x: ' '.join(x)})
Out[64]: 
         c                b
a                          
1  foo bar  [1, 2, 3, 2, 5]
2      baz           [5, 6]

This groups the data frame by the values in column a. Read more about [groupby].(http://pandas.pydata.org/pandas-docs/stable/groupby.html).

This is doing a regular list sum (concatenation) just like [1, 2, 3] + [2, 5]

Answered By: TomAugspurger

Answer #2:

df.groupby('column_a').agg(sum)

This works because of operator overloading sum concatenates the lists together. The index of the resulting df will be the values from column_a:

Answered By: qwwqwwq

Answer #3:

Use numpy and simple “for” or “map”:

import numpy as np

u_clm = np.unique(df.column_a.values)
all_lists = []

for clm in u_clm:
    df_process = df.query('column_a == @clm')
    list_ = np.concatenate(df.column_b.values)
    all_lists.append((clm, list_.tolist()))

df_sum_lists = pd.DataFrame(all_lists)

It’s faster in 350 times than a simple “groupby-agg-sum” approach for huge datasets.

Answered By: ooolllooo

Leave a Reply

Your email address will not be published. Required fields are marked *