What are the advantages of NumPy over regular Python lists?
I have approximately 100 financial markets series, and I am going to create a cube array of 100x100x100 = 1 million cells. I will be regressing (3-variable) each x with each y and z, to fill the array with standard errors.
I have heard that for “large matrices” I should use NumPy as opposed to Python lists, for performance and scalability reasons. Thing is, I know Python lists and they seem to work for me.
What will the benefits be if I move to NumPy?
What if I had 1000 series (that is, 1 billion floating point cells in the cube)?
NumPy’s arrays are more compact than Python lists — a list of lists as you describe, in Python, would take at least 20 MB or so, while a NumPy 3D array with single-precision floats in the cells would fit in 4 MB. Access in reading and writing items is also faster with NumPy.
Maybe you don’t care that much for just a million cells, but you definitely would for a billion cells — neither approach would fit in a 32-bit architecture, but with 64-bit builds NumPy would get away with 4 GB or so, Python alone would need at least about 12 GB (lots of pointers which double in size) — a much costlier piece of hardware!
The difference is mostly due to “indirectness” — a Python list is an array of pointers to Python objects, at least 4 bytes per pointer plus 16 bytes for even the smallest Python object (4 for type pointer, 4 for reference count, 4 for value — and the memory allocators rounds up to 16). A NumPy array is an array of uniform values — single-precision numbers takes 4 bytes each, double-precision ones, 8 bytes. Less flexible, but you pay substantially for the flexibility of standard Python lists!
NumPy is not just more efficient; it is also more convenient. You get a lot of vector and matrix operations for free, which sometimes allow one to avoid unnecessary work. And they are also efficiently implemented.
For example, you could read your cube directly from a file into an array:
x = numpy.fromfile(file=open("data"), dtype=float).reshape((100, 100, 100))
Sum along the second dimension:
s = x.sum(axis=1)
Find which cells are above a threshold:
(x > 0.5).nonzero()
Remove every even-indexed slice along the third dimension:
x[:, :, ::2]
Also, many useful libraries work with NumPy arrays. For example, statistical analysis and visualization libraries.
Even if you don’t have performance problems, learning NumPy is worth the effort.
Alex mentioned memory efficiency, and Roberto mentions convenience, and these are both good points. For a few more ideas, I’ll mention speed and functionality.
Functionality: You get a lot built in with NumPy, FFTs, convolutions, fast searching, basic statistics, linear algebra, histograms, etc. And really, who can live without FFTs?
Speed: Here’s a test on doing a sum over a list and a NumPy array, showing that the sum on the NumPy array is 10x faster (in this test — mileage may vary).
from numpy import arange from timeit import Timer Nelements = 10000 Ntimeits = 10000 x = arange(Nelements) y = range(Nelements) t_numpy = Timer("x.sum()", "from __main__ import x") t_list = Timer("sum(y)", "from __main__ import y") print("numpy: %.3e" % (t_numpy.timeit(Ntimeits)/Ntimeits,)) print("list: %.3e" % (t_list.timeit(Ntimeits)/Ntimeits,))
which on my systems (while I’m running a backup) gives:
numpy: 3.004e-05 list: 5.363e-04
Here’s a nice answer from the FAQ on the scipy.org website:
What advantages do NumPy arrays offer over (nested) Python lists?
Python’s lists are efficient general-purpose containers. They support
(fairly) efficient insertion, deletion, appending, and concatenation,
and Python’s list comprehensions make them easy to construct and
manipulate. However, they have certain limitations: they don’t support
“vectorized” operations like elementwise addition and multiplication,
and the fact that they can contain objects of differing types mean
that Python must store type information for every element, and must
execute type dispatching code when operating on each element. This
also means that very few list operations can be carried out by
efficient C loops – each iteration would require type checks and other
Python API bookkeeping.
All have highlighted almost all major differences between numpy array and python list, I will just brief them out here:
Numpy arrays have a fixed size at creation, unlike python lists (which can grow dynamically). Changing the size of ndarray will create a new array and delete the original.
The elements in a Numpy array are all required to be of the same data type (we can have the heterogeneous type as well but that will not gonna permit you mathematical operations) and thus will be the same size in memory
Numpy arrays are facilitated advances mathematical and other types of operations on large numbers of data. Typically such operations are executed more efficiently and with less code than is possible using pythons build in sequences
The standard mutable multielement container in Python is the list. Because of Python’s dynamic typing, we can even create heterogeneous list. To allow these flexible types, each item in the list must contain its own type info, reference count, and other information. That is, each item is a complete Python object.
In the special case that all variables are of the same type, much of this information is redundant; it can be much more efficient to store data in a fixed-type array (NumPy-style).
Fixed-type NumPy-style arrays lack this flexibility, but are much more efficient for storing and manipulating data.
- NumPy is not another programming language but a Python extension module. It provides fast and efficient operations on arrays of homogeneous data.
Numpy has fixed size of creation.
- In Python :lists are written with square brackets.
These lists can be homogeneous or heterogeneous
- The main advantages of using Numpy Arrays Over Python Lists:
- It consumes less memory.
- Fast as compared to the python List.
- Convenient to use.