# Efficient outer product in python

Posted on

### Question :

Efficient outer product in python

Outer product in python seems quite slow when we have to deal with vectors of dimension of order 10k. Could someone please give me some idea how could I speed up this operation in python?

Code is as follows:

`````` In : a.shape
Out: (128,)

In : b.shape
Out: (32000,)

In : %timeit np.outer(b,a)
100 loops, best of 3: 15.4 ms per loop
``````

Since I have to do this operation several times, my code is getting slower.

It doesn’t really get any faster than that, these are your options:

numpy.outer

``````>>> %timeit np.outer(a,b)
100 loops, best of 3: 9.79 ms per loop
``````

numpy.einsum

``````>>> %timeit np.einsum('i,j->ij', a, b)
100 loops, best of 3: 16.6 ms per loop
``````

numba

``````from numba.decorators import autojit

@autojit
def outer_numba(a, b):
m = a.shape
n = b.shape
result = np.empty((m, n), dtype=np.float)
for i in range(m):
for j in range(n):
result[i, j] = a[i]*b[j]
return result

>>> %timeit outer_numba(a,b)
100 loops, best of 3: 9.77 ms per loop
``````

parakeet

``````from parakeet import jit

@jit
def outer_parakeet(a, b):
... same as numba

>>> %timeit outer_parakeet(a, b)
100 loops, best of 3: 11.6 ms per loop
``````

cython

``````cimport numpy as np
import numpy as np
cimport cython
ctypedef np.float64_t DTYPE_t

@cython.boundscheck(False)
@cython.wraparound(False)
def outer_cython(np.ndarray[DTYPE_t, ndim=1] a, np.ndarray[DTYPE_t, ndim=1] b):
cdef int m = a.shape
cdef int n = b.shape
cdef np.ndarray[DTYPE_t, ndim=2] result = np.empty((m, n), dtype=np.float64)
for i in range(m):
for j in range(n):
result[i, j] = a[i]*b[j]
return result

>>> %timeit outer_cython(a, b)
100 loops, best of 3: 10.1 ms per loop
``````

theano

``````from theano import tensor as T
from theano import function

x = T.vector()
y = T.vector()

outer_theano = function([x, y], T.outer(x, y))

>>> %timeit outer_theano(a, b)
100 loops, best of 3: 17.4 ms per loop
``````

pypy

``````# Same code as the `outer_numba` function
>>> timeit.timeit("outer_pypy(a,b)", number=100, setup="import numpy as np;a = np.random.rand(128,);b = np.random.rand(32000,);from test import outer_pypy;outer_pypy(a,b)")*1000 / 100.0
16.36 # ms
``````

## Conclusions:

``````???????????????????????????????????
?  method   ? time(ms)* ? version ?
???????????????????????????????????
? numba     ? 9.77      ? 0.16.0  ?
? np.outer  ? 9.79      ? 1.9.1   ?
? cython    ? 10.1      ? 0.21.2  ?
? parakeet  ? 11.6      ? 0.23.2  ?
? pypy      ? 16.36     ? 2.4.0   ?
? np.einsum ? 16.6      ? 1.9.1   ?
? theano    ? 17.4      ? 0.6.0   ?
???????????????????????????????????
* less time = faster
``````

@elyase’s answer is great, and rightly accepted. Here’s an additional suggestion that, if you can use it, might make the call to `np.outer` even faster.

You say “I have to do this operation several times”, so it is possible that you can reuse the array that holds the outer product, instead of allocating a new one each time. That can give a nice boost in performance.

First, some random data to work with:

``````In : a = np.random.randn(128)

In : b = np.random.randn(32000)
``````

Here’s the baseline timing for np.outer(a, b) on my computer:

``````In : %timeit np.outer(a, b)
100 loops, best of 3: 5.52 ms per loop
``````

Suppose we’re going to repeat that operation several times, with arrays of the same shape. Create an `out` array to hold the result:

``````In : out = np.empty((128, 32000))
``````

Now use `out` as the third argument of `np.outer`:

``````In : %timeit np.outer(a, b, out)
100 loops, best of 3: 2.38 ms per loop
``````

So you get a nice performance boost if you can reuse the array that holds the outer product.

You get a similar benefit if you use the `out` argument of `einsum`, and in the cython function if you add a third argument for the output instead of allocating it in the function with `np.empty`. (The other compiled/jitted codes in @elyase’s answer will probably benefit from this, too, but I only tried the cython version.)

Nota bene! The benefit shown above might not be realized in practice. The `out` array fits in the L3 cache of my CPU, and when it is used in the loop performed by the `timeit` command, it likely remains in the cache. In practice, the array might be moved out of the cache between calls to `np.outer`. In that case, the improvement isn’t so dramatic, but it should still be at least the cost of a call to `np.empty()`, i.e.

``````In : %timeit np.empty((128, 32000))
1000 loops, best of 3: 1.29 ms per loop
``````

It should be as simple as using `numpy.outer()`: a single function call which will be implemented in C for high performance.