Question :
I just started using Numpy and noticed that iterating through each element in a Numpy array is ~4x slower than doing the same but with a list of lists. I know now that this defeats the purpose of Numpy and I should vectorize the function if possible. My question is though why is it 4x slower. That seems like quite a large amount.
I ran the tests below using %timeit
import numpy as np
b = np.eye(1000)
a = b.tolist()
%timeit b[100][100] #1000000 loops, best of 3: 692 ns per loop
%timeit a[100][100] #10000000 loops, best of 3: 70.7 ns per loop
%timeit b[100,100] #1000000 loops, best of 3: 343 ns per loop
%timeit b.item(100,100) #1000000 loops, best of 3: 297 ns per loop
I tried to use dis.dis
to see what was going on under the hood but got:
TypeError: don't know how to disassemble methodwrapper objects
Then I tried to look at the Numpy source code but couldn’t figure out which file corresponded to array element access. I’m curious what accounts for the extra overhead, and more importantly how to figure this out for myself in the future. It seems like python can’t be easily compiled to C code so that I can see the difference. But is there a way to see what byte code is generated for each line, to get a sense of the differences?
Answer #1:
In summary: getting an item from a NumPy array requires new Python objects to be created, whereas this is not the case for lists. Also, indexing is more slightly more complicated for NumPy arrays than lists which may add some additional overhead.
To recap, the NumPy operations you have listed do the following:
b[100][100]
returns row 100 ofb
as an array, and then gets the value at index 100 of this row, returning the value as an object (e.g. anp.int64
type).b[100,100]
returns the value at row 100 and column 100 directly (no intermediate array is returned first).b.item(100,100)
does the same as aboveb[100,100]
except that the value is converted to a native Python type and returned.
Now of these operation, (1) is slowest because it requires two sequential NumPy indexing operations (I’ll explain why this is slower than list indexing below). (2) is quickest because only a single indexing operation is performed. Operation (3) is possibly slower as it is a method call (these are generally slow in Python).
Why is list access still faster than b[100,100]
?
Object creation
Python lists are arrays of pointers to objects in memory. For example, the list [1, 2, 3]
does not contain those integers directly, but rather pointers to the memory addresses were each integer object already exists. To get an item from the list, Python just returns a reference to the object.
NumPy arrays are not collections of objects. The array np.array([1, 2, 3])
is just a contiguous block of memory with bits set to represent those integer values. To get an integer from this array, a new Python object must be constructed in memory separate to the array. For instance, an object of np.int64
may be returned by the indexing operation: this object did not exist previously and had to be created.
Indexing complexity
Two other reasons why a[100][100]
(getting from the list) is quicker than b[100,100]
(getting from the array) are that:

The bytecode opcode
BINARY_SUBSCR
is executed when indexing both lists and arrays, but it is optimised for the case of Python lists. 
The internal C function handling integer indexing for Python lists is very short and simple. On the other hand, NumPy indexing is much more complicated and a significant amount of code is executed to determine the type of indexing being used so that the correct value can be returned.
Below, the steps for accessing elements in a list and array with a[100][100]
and b[100,100]
are described in more detail.
Bytecode
The same four bytecode opcodes are triggered for both lists and arrays:
0 LOAD_NAME 0 (a) # the list or array
3 LOAD_CONST 0 (100) # index number (tuple for b[100,100])
6 BINARY_SUBSCR # find correct "getitem" function
7 RETURN_VALUE # value returned from list or array
Note: if you start chain indexing for multidimensional lists, e.g. a[100][100][100]
, you start to repeat these bytecode instructions. This does not happen for NumPy arrays using the tuple indexing: b[100,100,100]
uses just the four instructions. This is why the gap in the timings begins to close as the number of dimensions increases.
Finding the correct “getitem” function
The functions for accessing lists and arrays are different and the correct one needs to be found in each case. This task is handled by the BINARY_SUBSCR
opcode:
w = POP(); // our index
v = TOP(); // our list or NumPy array
if (PyList_CheckExact(v) && PyInt_CheckExact(w)) { // do we have a list and an int?
/* INLINE: list[int] */
Py_ssize_t i = PyInt_AsSsize_t(w);
if (i < 0)
i += PyList_GET_SIZE(v);
if (i >= 0 && i < PyList_GET_SIZE(v)) {
x = PyList_GET_ITEM(v, i); // call "getitem" for lists
Py_INCREF(x);
}
else
goto slow_get;
}
else
slow_get:
x = PyObject_GetItem(v, w); // else, call another function
// to work out what is needed
Py_DECREF(v);
Py_DECREF(w);
SET_TOP(x);
if (x != NULL) continue;
break;
This code is optimised for Python lists. If the function sees a list, it will quickly call the function PyList_GET_ITEM
. This list can now be accessed at the required index (see next section below).
However, if it doesn’t see a list (e.g. we have a NumPy array), it takes the “slow_get” path. This in turn calls another function PyObject_GetItem
to check which “getitem” function the object is mapped to:
PyObject_GetItem(PyObject *o, PyObject *key)
{
PyMappingMethods *m;
if (o == NULL  key == NULL)
return null_error();
m = o>ob_type>tp_as_mapping;
if (m && m>mp_subscript)
return m>mp_subscript(o, key);
...
In the case of NumPy arrays, the correct function is located in mp_subscript
in the PyMappingMethods
structure.
Notice the additional function calls before this correct “get” function can be called. These calls add to the overhead for b[100]
, although how much will depend on how Python/NumPy was compiled, the system architecture, and so on.
Getting from a Python list
Above it was seen that the function PyList_GET_ITEM
is called. This is a short function that essentially looks like this*:
PyList_GetItem(PyObject *op, Py_ssize_t i)
{
if (!PyList_Check(op)) { // check if list
PyErr_BadInternalCall();
return NULL;
}
if (i < 0  i >= Py_SIZE(op)) { // check i is in range
if (indexerr == NULL) {
indexerr = PyUnicode_FromString(
"list index out of range");
if (indexerr == NULL)
return NULL;
}
PyErr_SetObject(PyExc_IndexError, indexerr);
return NULL;
}
return ((PyListObject *)op) > ob_item[i]; // return reference to object
}
* PyList_GET_ITEM
is actually the macro form of this function which does the same thing, minus error checking.
This means that getting the item at index i
of a Python list is relatively simple. Internally, Python checks whether the type of the item being is a list, whether i
is in the correct range for the list, and then returns the reference to the object in the list.
Getting from a NumPy array
In contrast, NumPy has to do much more work before the value at the requested index can be returned.
Arrays can be indexed in a variety of different ways and NumPy has to decide which index routine is needed. The various indexing routines are handled largely by code in mapping.c
.
Anything used to index NumPy arrays passes through the function prepare_index
which begins the parsing of the index and stores the information about broadcasting, number of dimensions, and so on. Here is the call signature for the function:
NPY_NO_EXPORT int
prepare_index(PyArrayObject *self, PyObject *index,
npy_index_info *indices,
int *num, int *ndim, int *out_fancy_ndim, int allow_boolean)
/* @param the array being indexed
* @param the index object
* @param index info struct being filled (size of NPY_MAXDIMS * 2 + 1)
* @param number of indices found
* @param dimension of the indexing result
* @param dimension of the fancy/advanced indices part
* @param whether to allow the boolean special case
*/
The function has to do a lot of checks. Even for a relatively simple index such as b[100,100]
, a lot of information has to be inferred so that NumPy can return a reference (view) to the correct value.
In conclusion, it takes longer for the “getitem” function for NumPy to be found and the functions handling the indexing of arrays are necessarily more complex than the single function for Python lists.
Answer #2:
When numpy returns the item from one position in the array, it must convert the internal C type (float, double, etc.) value into a Pythontyped scalar value (int, long, float). It then returns the reference to the Pythontyped value. This conversion takes some time.
Interestingly enough, the same inefficiency hurts performance in another way, too. I had a Python list that I was indexing into using index values that came from a numpy array. That same conversion occurs to create the Python integer value that is needed to index into a Python list. I had to rewrite my algorithm with an intermediate array of native Python integers.