# Is there any numpy group by function?

Posted on

### Question :

Is there any numpy group by function?

Is there any function in numpy to group this array down below by the first column?

I couldn’t find any good answer over the internet..

``````>>> a
array([[  1, 275],
[  1, 441],
[  1, 494],
[  1, 593],
[  2, 679],
[  2, 533],
[  2, 686],
[  3, 559],
[  3, 219],
[  3, 455],
[  4, 605],
[  4, 468],
[  4, 692],
[  4, 613]])
``````

Wanted output:

``````array([[[275, 441, 494, 593]],
[[679, 533, 686]],
[[559, 219, 455]],
[[605, 468, 692, 613]]], dtype=object)
``````

Inspired by Eelco Hoogendoorn’s library, but without his library, and using the fact that the first column of your array is always increasing (if not, sort first with inplace `a.sort(axis=0)`)

``````>>> np.split(a[:,1], np.unique(a[:, 0], return_index=True)[1:])
[array([275, 441, 494, 593]),
array([679, 533, 686]),
array([559, 219, 455]),
array([605, 468, 692, 613])]
``````

I didn’t “timeit” but this is probably the faster way to achieve the question :

• No python native loop
• Result lists are numpy arrays, in case you need to make other numpy operations on them, no new conversion will be needed
• Complexity like O(n)

[EDIT] I improved the answer thanks to
ns63sr

The numpy_indexed package (disclaimer: I am its author) aims to fill this gap in numpy. All operations in numpy-indexed are fully vectorized, and no O(n^2) algorithms were harmed during the making of this library.

``````import numpy_indexed as npi
npi.group_by(a[:, 0]).split(a[:, 1])
``````

Note that it is usually more efficient to directly compute relevant properties over such groups (ie, group_by(keys).mean(values)), rather than first splitting into a list / jagged array.

Numpy is not very handy here because the desired output is not an array of integers (it is an array of list objects).

I suggest either the pure Python way…

``````from collections import defaultdict

%%timeit
d = defaultdict(list)
for key, val in a:
d[key].append(val)
10.7 µs ± 156 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# result:
defaultdict(list,
{1: [275, 441, 494, 593],
2: [679, 533, 686],
3: [559, 219, 455],
4: [605, 468, 692, 613]})
``````

…or the pandas way:

``````import pandas as pd

%%timeit
df = pd.DataFrame(a, columns=["key", "val"])
df.groupby("key").val.apply(pd.Series.tolist)
979 µs ± 3.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# result:
key
1    [275, 441, 494, 593]
2         [679, 533, 686]
3         [559, 219, 455]
4    [605, 468, 692, 613]
Name: val, dtype: object
``````

``````n = np.unique(a[:,0])
np.array( [ list(a[a[:,0]==i,1]) for i in n] )
``````

outputs:

``````array([[275, 441, 494, 593], [679, 533, 686], [559, 219, 455],
[605, 468, 692, 613]], dtype=object)
``````

Simplifying the answer of Vincent J and considering the comment of HS-nebula one can use `return_index = True` instead of `return_counts = True` and get rid of the `cumsum`:

``````np.split(a[:,1], np.unique(a[:,0], return_index = True))[1:]
``````

Output

``````[array([275, 441, 494, 593]),
array([679, 533, 686]),
array([559, 219, 455]),
array([605, 468, 692, 613])]
``````

I used np.unique() followed by np.extract()

``````unique = np.unique(a[:, 0:1])
for element in unique:
present = a[:,0]==element
``````

`[array([275, 441, 494, 593]), array([679, 533, 686]), array([559, 219, 455]), array([605, 468, 692, 613])]`

given X as array of items you want to be grouped and y (1D array) as corresponding groups, following function does the grouping with numpy:

``````def groupby(X, y):
y = np.asarray(y)
X = np.asarray(X)
y_uniques = np.unique(y)
return [X[y==yi] for yi in y_uniques]
``````

So, `groupby(a[:,1], a[:,0])` returns
`[array([275, 441, 494, 593]), array([679, 533, 686]), array([559, 219, 455]), array([605, 468, 692, 613])]`

We might also find it useful to generate a `dict`:

``````def groupby(X):
X = np.asarray(X)
x_uniques = np.unique(X)
return {xi:X[X==xi] for xi in x_uniques}
``````

Let’s try it out:

``````X=[1,1,2,2,3,3,3,3,4,5,6,7,7,8,9,9,1,1,1]
groupby(X)
Out:
{1: array([1, 1, 1, 1, 1]),
2: array([2, 2]),
3: array([3, 3, 3, 3]),
4: array(),
5: array(),
6: array(),
7: array([7, 7]),
8: array(),
9: array([9, 9])}
``````

Note this by itself is not super compelling – but if we make `X` an `object` or `namedtuple` and then provide a `groupby` function it becomes more interesting. Will put that in later.