### Question :

I have a function `foo`

that takes a pointer to memory as argument and both writes and reads to that memory:

```
cdef void foo (double *data):
data[some_index_int] = some_value_double
do_something_dependent_on (data)
```

I’m allocating to `data`

like so:

```
cdef int N = some_int
cdef double *data = <double*> malloc (N * sizeof (double))
cdef int i
for i in cython.parallel.prange (N, nogil=True):
foo (data)
readout (data)
```

My question is now: how do the different threads treat this? My guess is that the memory pointed to by `data`

will be shared by all threads and ‘simultaneously’ read from or written to while inside the function `foo`

. This would then mess up all results as one can’t rely on a previously set datavalue (within `foo`

)? Is my guessing correct or is there some magic safety-belt implemented in the cython-compiler?

Thank you very much in advance.

##
Answer #1:

I assume that without read or write synchronization locks to `data`

threads will read/write to the memory location and overwrite each other’s changes. You will not get consistent results without some kind of synchronization.

Although the docs (http://docs.cython.org/src/userguide/parallelism.html) seem to suggest that OpenMP (*the default backend*) automatically creates thread locals.

##
Answer #2:

A good way is to have the main array beeing alocated ouside the threads. Then you give to each thread the pointer to the part of the main array that should be computed by the thread.

The following example is an implementation of a matrix multiplication (similar to `dot`

for 2-D arrays) where:

```
c = a*b
```

The parallelism here is implemented over the rows of `a`

. Check how the pointers are passed to the `multiply`

function in order allow the different threads to share the same arrays.

```
import numpy as np
cimport numpy as np
import cython
from cython.parallel import prange
ctypedef np.double_t cDOUBLE
DOUBLE = np.float64
def mydot(np.ndarray[cDOUBLE, ndim=2] a, np.ndarray[cDOUBLE, ndim=2] b):
cdef np.ndarray[cDOUBLE, ndim=2] c
cdef int i, M, N, K
c = np.zeros((a.shape[0], b.shape[1]), dtype=DOUBLE)
M = a.shape[0]
N = a.shape[1]
K = b.shape[1]
for i in prange(M, nogil=True):
multiply(&a[i,0], &b[0,0], &c[i,0], N, K)
return c
@cython.wraparound(False)
@cython.boundscheck(False)
@cython.nonecheck(False)
cdef void multiply(double *a, double *b, double *c, int N, int K) nogil:
cdef int j, k
for j in range(N):
for k in range(K):
c[k] += a[j]*b[k+j*K]
```

To check you can use this script:

```
import time
import numpy as np
import _stack
a = np.random.random((10000,500))
b = np.random.random((500,2000))
t = time.clock()
c = np.dot(a, b)
print('finished dot: {} s'.format(time.clock()-t))
t = time.clock()
c2 = _stack.mydot(a, b)
print('finished mydot: {} s'.format(time.clock()-t))
print 'Passed test:', np.allclose(c, c2)
```

Where on my computer it gives:

```
finished dot: 0.601547366526 s
finished mydot: 2.834147917 s
Passed test: True
```

If the number of rows of `a`

was smaller than then number of cols or the number of cols in `b`

the `mydot`

would be worse, requiring a better check on which dimension to make the parallelism.