### Question :

Outer product in python seems quite slow when we have to deal with vectors of dimension of order 10k. Could someone please give me some idea how could I speed up this operation in python?

Code is as follows:

```
In [8]: a.shape
Out[8]: (128,)
In [9]: b.shape
Out[9]: (32000,)
In [10]: %timeit np.outer(b,a)
100 loops, best of 3: 15.4 ms per loop
```

Since I have to do this operation several times, my code is getting slower.

##
Answer #1:

It doesn’t really get any faster than that, these are your options:

**numpy.outer**

```
>>> %timeit np.outer(a,b)
100 loops, best of 3: 9.79 ms per loop
```

**numpy.einsum**

```
>>> %timeit np.einsum('i,j->ij', a, b)
100 loops, best of 3: 16.6 ms per loop
```

**numba**

```
from numba.decorators import autojit
@autojit
def outer_numba(a, b):
m = a.shape[0]
n = b.shape[0]
result = np.empty((m, n), dtype=np.float)
for i in range(m):
for j in range(n):
result[i, j] = a[i]*b[j]
return result
>>> %timeit outer_numba(a,b)
100 loops, best of 3: 9.77 ms per loop
```

**parakeet**

```
from parakeet import jit
@jit
def outer_parakeet(a, b):
... same as numba
>>> %timeit outer_parakeet(a, b)
100 loops, best of 3: 11.6 ms per loop
```

**cython**

```
cimport numpy as np
import numpy as np
cimport cython
ctypedef np.float64_t DTYPE_t
@cython.boundscheck(False)
@cython.wraparound(False)
def outer_cython(np.ndarray[DTYPE_t, ndim=1] a, np.ndarray[DTYPE_t, ndim=1] b):
cdef int m = a.shape[0]
cdef int n = b.shape[0]
cdef np.ndarray[DTYPE_t, ndim=2] result = np.empty((m, n), dtype=np.float64)
for i in range(m):
for j in range(n):
result[i, j] = a[i]*b[j]
return result
>>> %timeit outer_cython(a, b)
100 loops, best of 3: 10.1 ms per loop
```

**theano**

```
from theano import tensor as T
from theano import function
x = T.vector()
y = T.vector()
outer_theano = function([x, y], T.outer(x, y))
>>> %timeit outer_theano(a, b)
100 loops, best of 3: 17.4 ms per loop
```

**pypy**

```
# Same code as the `outer_numba` function
>>> timeit.timeit("outer_pypy(a,b)", number=100, setup="import numpy as np;a = np.random.rand(128,);b = np.random.rand(32000,);from test import outer_pypy;outer_pypy(a,b)")*1000 / 100.0
16.36 # ms
```

**Conclusions:**

```
???????????????????????????????????
? method ? time(ms)* ? version ?
???????????????????????????????????
? numba ? 9.77 ? 0.16.0 ?
? np.outer ? 9.79 ? 1.9.1 ?
? cython ? 10.1 ? 0.21.2 ?
? parakeet ? 11.6 ? 0.23.2 ?
? pypy ? 16.36 ? 2.4.0 ?
? np.einsum ? 16.6 ? 1.9.1 ?
? theano ? 17.4 ? 0.6.0 ?
???????????????????????????????????
* less time = faster
```

##
Answer #2:

@elyase’s answer is great, and rightly accepted. Here’s an additional suggestion that, if you can use it, might make the call to `np.outer`

even faster.

You say “I have to do this operation several times”, so it is possible that you can reuse the array that holds the outer product, instead of allocating a new one each time. That can give a nice boost in performance.

First, some random data to work with:

```
In [32]: a = np.random.randn(128)
In [33]: b = np.random.randn(32000)
```

Here’s the baseline timing for np.outer(a, b) on my computer:

```
In [34]: %timeit np.outer(a, b)
100 loops, best of 3: 5.52 ms per loop
```

Suppose we’re going to repeat that operation several times, with arrays of the same shape. Create an `out`

array to hold the result:

```
In [35]: out = np.empty((128, 32000))
```

Now use `out`

as the third argument of `np.outer`

:

```
In [36]: %timeit np.outer(a, b, out)
100 loops, best of 3: 2.38 ms per loop
```

So you get a nice performance boost if you can reuse the array that holds the outer product.

You get a similar benefit if you use the `out`

argument of `einsum`

, and in the cython function if you add a third argument for the output instead of allocating it in the function with `np.empty`

. (The other compiled/jitted codes in @elyase’s answer will probably benefit from this, too, but I only tried the cython version.)

*Nota bene!* The benefit shown above might not be realized in practice. The `out`

array fits in the L3 cache of my CPU, and when it is used in the loop performed by the `timeit`

command, it likely remains in the cache. In practice, the array might be moved out of the cache between calls to `np.outer`

. In that case, the improvement isn’t so dramatic, but it should still be at least the cost of a call to `np.empty()`

, i.e.

```
In [53]: %timeit np.empty((128, 32000))
1000 loops, best of 3: 1.29 ms per loop
```

##
Answer #3:

It should be as simple as using `numpy.outer()`

: a single function call which will be implemented in C for high performance.