# Convert pandas dataframe to NumPy array

Posted on

Solving problem is about exposing yourself to as many situations as possible like Convert pandas dataframe to NumPy array and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about Convert pandas dataframe to NumPy array, which can be followed any time. Take easy to follow this discuss.

Convert pandas dataframe to NumPy array

I am interested in knowing how to convert a pandas dataframe into a NumPy array.

dataframe:

``````import numpy as np
import pandas as pd
index = [1, 2, 3, 4, 5, 6, 7]
a = [np.nan, np.nan, np.nan, 0.1, 0.1, 0.1, 0.1]
b = [0.2, np.nan, 0.2, 0.2, 0.2, np.nan, np.nan]
c = [np.nan, 0.5, 0.5, np.nan, 0.5, 0.5, np.nan]
df = pd.DataFrame({'A': a, 'B': b, 'C': c}, index=index)
df = df.rename_axis('ID')
``````

gives

``````label   A    B    C
ID
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN
``````

I would like to convert this to a NumPy array, as so:

``````array([[ nan,  0.2,  nan],
[ nan,  nan,  0.5],
[ nan,  0.2,  0.5],
[ 0.1,  0.2,  nan],
[ 0.1,  0.2,  0.5],
[ 0.1,  nan,  0.5],
[ 0.1,  nan,  nan]])
``````

How can I do this?

As a bonus, is it possible to preserve the dtypes, like this?

``````array([[ 1, nan,  0.2,  nan],
[ 2, nan,  nan,  0.5],
[ 3, nan,  0.2,  0.5],
[ 4, 0.1,  0.2,  nan],
[ 5, 0.1,  0.2,  0.5],
[ 6, 0.1,  nan,  0.5],
[ 7, 0.1,  nan,  nan]],
dtype=[('ID', '<i4'), ('A', '<f8'), ('B', '<f8'), ('B', '<f8')])
``````

or similar?

To convert a pandas dataframe (df) to a numpy ndarray, use this code:

``````df.values
array([[nan, 0.2, nan],
[nan, nan, 0.5],
[nan, 0.2, 0.5],
[0.1, 0.2, nan],
[0.1, 0.2, 0.5],
[0.1, nan, 0.5],
[0.1, nan, nan]])
``````

# `df.to_numpy()` is better than `df.values`, here’s why.*

It’s time to deprecate your usage of `values` and `as_matrix()`.

pandas `v0.24.0` introduced two new methods for obtaining NumPy arrays from pandas objects:

1. `to_numpy()`, which is defined on `Index`, `Series`, and `DataFrame` objects, and
2. `array`, which is defined on `Index` and `Series` objects only.

If you visit the v0.24 docs for `.values`, you will see a big red warning that says:

### Warning: We recommend using `DataFrame.to_numpy()` instead.

* – `to_numpy()` is my recommended method for any production code that needs to run reliably for many versions into the future. However if you’re just making a scratchpad in jupyter or the terminal, using `.values` to save a few milliseconds of typing is a permissable exception. You can always add the fit n finish later.

# Towards Better Consistency: `to_numpy()`

In the spirit of better consistency throughout the API, a new method `to_numpy` has been introduced to extract the underlying NumPy array from DataFrames.

``````# Setup
df = pd.DataFrame(data={'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]},
index=['a', 'b', 'c'])
# Convert the entire DataFrame
df.to_numpy()
# array([[1, 4, 7],
#        [2, 5, 8],
#        [3, 6, 9]])
# Convert specific columns
df[['A', 'C']].to_numpy()
# array([[1, 7],
#        [2, 8],
#        [3, 9]])
``````

As mentioned above, this method is also defined on `Index` and `Series` objects (see here).

``````df.index.to_numpy()
# array(['a', 'b', 'c'], dtype=object)
df['A'].to_numpy()
#  array([1, 2, 3])
``````

By default, a view is returned, so any modifications made will affect the original.

``````v = df.to_numpy()
v[0, 0] = -1
df
A  B  C
a -1  4  7
b  2  5  8
c  3  6  9
``````

If you need a copy instead, use `to_numpy(copy=True)`.

### pandas >= 1.0 update for ExtensionTypes

If you’re using pandas 1.x, chances are you’ll be dealing with extension types a lot more. You’ll have to be a little more careful that these extension types are correctly converted.

``````a = pd.array([1, 2, None], dtype="Int64")
a
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64
# Wrong
a.to_numpy()
# array([1, 2, <NA>], dtype=object)  # yuck, objects
# Correct
a.to_numpy(dtype='float', na_value=np.nan)
# array([ 1.,  2., nan])
# Also correct
a.to_numpy(dtype='int', na_value=-1)
# array([ 1,  2, -1])
``````

This is called out in the docs.

### If you need the `dtypes` in the result…

As shown in another answer, `DataFrame.to_records` is a good way to do this.

``````df.to_records()
# rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)],
#           dtype=[('index', 'O'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
``````

This cannot be done with `to_numpy`, unfortunately. However, as an alternative, you can use `np.rec.fromrecords`:

``````v = df.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
# rec.array([('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)],
#           dtype=[('index', '<U1'), ('A', '<i8'), ('B', '<i8'), ('C', '<i8')])
``````

Performance wise, it’s nearly the same (actually, using `rec.fromrecords` is a bit faster).

``````df2 = pd.concat([df] * 10000)
%timeit df2.to_records()
%%timeit
v = df2.reset_index()
np.rec.fromrecords(v, names=v.columns.tolist())
12.9 ms ± 511 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.56 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
``````

# Rationale for Adding a New Method

`to_numpy()` (in addition to `array`) was added as a result of discussions under two GitHub issues GH19954 and GH23623.

Specifically, the docs mention the rationale:

[…] with `.values` it was unclear whether the returned value would be the
actual array, some transformation of it, or one of pandas custom
arrays (like `Categorical`). For example, with `PeriodIndex`, `.values`
generates a new `ndarray` of period objects each time. […]

`to_numpy` aims to improve the consistency of the API, which is a major step in the right direction. `.values` will not be deprecated in the current version, but I expect this may happen at some point in the future, so I would urge users to migrate towards the newer API, as soon as you can.

# Critique of Other Solutions

`DataFrame.values` has inconsistent behaviour, as already noted.

`DataFrame.get_values()` is simply a wrapper around `DataFrame.values`, so everything said above applies.

`DataFrame.as_matrix()` is deprecated now, do NOT use!

Note: The `.as_matrix()` method used in this answer is deprecated. Pandas 0.23.4 warns:

Method `.as_matrix` will be removed in a future version. Use .values instead.

Pandas has something built in…

``````numpy_matrix = df.as_matrix()
``````

gives

``````array([[nan, 0.2, nan],
[nan, nan, 0.5],
[nan, 0.2, 0.5],
[0.1, 0.2, nan],
[0.1, 0.2, 0.5],
[0.1, nan, 0.5],
[0.1, nan, nan]])
``````

I would just chain the DataFrame.reset_index() and DataFrame.values functions to get the Numpy representation of the dataframe, including the index:

``````In [8]: df
Out[8]:
A         B         C
0 -0.982726  0.150726  0.691625
1  0.617297 -0.471879  0.505547
2  0.417123 -1.356803 -1.013499
3 -0.166363 -0.957758  1.178659
4 -0.164103  0.074516 -0.674325
5 -0.340169 -0.293698  1.231791
6 -1.062825  0.556273  1.508058
7  0.959610  0.247539  0.091333
[8 rows x 3 columns]
In [9]: df.reset_index().values
Out[9]:
array([[ 0.        , -0.98272574,  0.150726  ,  0.69162512],
[ 1.        ,  0.61729734, -0.47187926,  0.50554728],
[ 2.        ,  0.4171228 , -1.35680324, -1.01349922],
[ 3.        , -0.16636303, -0.95775849,  1.17865945],
[ 4.        , -0.16410334,  0.0745164 , -0.67432474],
[ 5.        , -0.34016865, -0.29369841,  1.23179064],
[ 6.        , -1.06282542,  0.55627285,  1.50805754],
[ 7.        ,  0.95961001,  0.24753911,  0.09133339]])
``````

To get the dtypes we’d need to transform this ndarray into a structured array using view:

``````In [10]: df.reset_index().values.ravel().view(dtype=[('index', int), ('A', float), ('B', float), ('C', float)])
Out[10]:
array([( 0, -0.98272574,  0.150726  ,  0.69162512),
( 1,  0.61729734, -0.47187926,  0.50554728),
( 2,  0.4171228 , -1.35680324, -1.01349922),
( 3, -0.16636303, -0.95775849,  1.17865945),
( 4, -0.16410334,  0.0745164 , -0.67432474),
( 5, -0.34016865, -0.29369841,  1.23179064),
( 6, -1.06282542,  0.55627285,  1.50805754),
( 7,  0.95961001,  0.24753911,  0.09133339),
dtype=[('index', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
``````

You can use the `to_records` method, but have to play around a bit with the dtypes if they are not what you want from the get go. In my case, having copied your DF from a string, the index type is string (represented by an `object` dtype in pandas):

``````In [102]: df
Out[102]:
label    A    B    C
ID
1      NaN  0.2  NaN
2      NaN  NaN  0.5
3      NaN  0.2  0.5
4      0.1  0.2  NaN
5      0.1  0.2  0.5
6      0.1  NaN  0.5
7      0.1  NaN  NaN
In [103]: df.index.dtype
Out[103]: dtype('object')
In [104]: df.to_records()
Out[104]:
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
(4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
(7, 0.1, nan, nan)],
dtype=[('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
In [106]: df.to_records().dtype
Out[106]: dtype([('index', '|O8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
``````

Converting the recarray dtype does not work for me, but one can do this in Pandas already:

``````In [109]: df.index = df.index.astype('i8')
In [111]: df.to_records().view([('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
Out[111]:
rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
(4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
(7, 0.1, nan, nan)],
dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
``````

Note that Pandas does not set the name of the index properly (to `ID`) in the exported record array (a bug?), so we profit from the type conversion to also correct for that.

At the moment Pandas has only 8-byte integers, `i8`, and floats, `f8` (see this issue).

It seems like `df.to_records()` will work for you. The exact feature you’re looking for was requested and `to_records` pointed to as an alternative.

I tried this out locally using your example, and that call yields something very similar to the output you were looking for:

``````rec.array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
(4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
(7, 0.1, nan, nan)],
dtype=[(u'ID', '<i8'), (u'A', '<f8'), (u'B', '<f8'), (u'C', '<f8')])
``````

Note that this is a `recarray` rather than an `array`. You could move the result in to regular numpy array by calling its constructor as `np.array(df.to_records())`.

Try this:

``````a = numpy.asarray(df)
``````

Here is my approach to making a structure array from a pandas DataFrame.

Create the data frame

``````import pandas as pd
import numpy as np
import six
NaN = float('nan')
ID = [1, 2, 3, 4, 5, 6, 7]
A = [NaN, NaN, NaN, 0.1, 0.1, 0.1, 0.1]
B = [0.2, NaN, 0.2, 0.2, 0.2, NaN, NaN]
C = [NaN, 0.5, 0.5, NaN, 0.5, 0.5, NaN]
columns = {'A':A, 'B':B, 'C':C}
df = pd.DataFrame(columns, index=ID)
df.index.name = 'ID'
print(df)
A    B    C
ID
1   NaN  0.2  NaN
2   NaN  NaN  0.5
3   NaN  0.2  0.5
4   0.1  0.2  NaN
5   0.1  0.2  0.5
6   0.1  NaN  0.5
7   0.1  NaN  NaN
``````

Define function to make a numpy structure array (not a record array) from a pandas DataFrame.

``````def df_to_sarray(df):
"""
Convert a pandas DataFrame object to a numpy structured array.
This is functionally equivalent to but more efficient than
np.array(df.to_array())
:param df: the data frame to convert
:return: a numpy structured array representation of df
"""
v = df.values
cols = df.columns
if six.PY2:  # python 2 needs .encode() but 3 does not
types = [(cols[i].encode(), df[k].dtype.type) for (i, k) in enumerate(cols)]
else:
types = [(cols[i], df[k].dtype.type) for (i, k) in enumerate(cols)]
dtype = np.dtype(types)
z = np.zeros(v.shape[0], dtype)
for (i, k) in enumerate(z.dtype.names):
z[k] = v[:, i]
return z
``````

Use `reset_index` to make a new data frame that includes the index as part of its data. Convert that data frame to a structure array.

``````sa = df_to_sarray(df.reset_index())
sa
array([(1L, nan, 0.2, nan), (2L, nan, nan, 0.5), (3L, nan, 0.2, 0.5),
(4L, 0.1, 0.2, nan), (5L, 0.1, 0.2, 0.5), (6L, 0.1, nan, 0.5),
(7L, 0.1, nan, nan)],
dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
``````

EDIT: Updated df_to_sarray to avoid error calling .encode() with python 3. Thanks to Joseph Garvin and halcyon for their comment and solution.