### Question :

Is there a way to convert from a `pandas.SparseDataFrame`

to `scipy.sparse.csr_matrix`

, without generating a dense matrix in memory?

```
scipy.sparse.csr_matrix(df.values)
```

doesn’t work as it generates a dense matrix which is cast to the `csr_matrix`

.

Thanks in advance!

##
Answer #1:

Pandas docs talks about an experimental conversion to scipy sparse, SparseSeries.to_coo:

http://pandas-docs.github.io/pandas-docs-travis/sparse.html#interaction-with-scipy-sparse

================

edit – this is a special function from a multiindex, not a data frame. See the other answers for that. Note the difference in dates.

============

As of 0.20.0, there is a `sdf.to_coo()`

and a multiindex `ss.to_coo()`

. Since a sparse matrix is inherently 2d, it makes sense to require multiindex for the (effectively) 1d dataseries. While the dataframe can represent a table or 2d array.

When I first responded to this question this sparse dataframe/series feature was experimental (june 2015).

##
Answer #2:

# Pandas 0.20.0+:

As of pandas version 0.20.0, released May 5, 2017, there is a one-liner for this:

```
from scipy import sparse
def sparse_df_to_csr(df):
return sparse.csr_matrix(df.to_coo())
```

This uses the new `to_coo()`

method.

# Earlier Versions:

Building on Victor May’s answer, here’s a slightly faster implementation, but it only works if the entire `SparseDataFrame`

is sparse with all `BlockIndex`

(note: if it was created with `get_dummies`

, this will be the case).

**Edit**: I modified this so it will work with a non-zero fill value. CSR has no native non-zero fill value, so you will have to record it externally.

```
import numpy as np
import pandas as pd
from scipy import sparse
def sparse_BlockIndex_df_to_csr(df):
columns = df.columns
zipped_data = zip(*[(df[col].sp_values - df[col].fill_value,
df[col].sp_index.to_int_index().indices)
for col in columns])
data, rows = map(list, zipped_data)
cols = [np.ones_like(a)*i for (i,a) in enumerate(data)]
data_f = np.concatenate(data)
rows_f = np.concatenate(rows)
cols_f = np.concatenate(cols)
arr = sparse.coo_matrix((data_f, (rows_f, cols_f)),
df.shape, dtype=np.float64)
return arr.tocsr()
```

##
Answer #3:

The answer by @Marigold does the trick, but it is slow due to accessing all elements in each column, including the zeros. Building on it, I wrote the following quick n’ dirty code, which runs about 50x faster on a 1000×1000 matrix with a density of about 1%. My code also handles dense columns appropriately.

```
def sparse_df_to_array(df):
num_rows = df.shape[0]
data = []
row = []
col = []
for i, col_name in enumerate(df.columns):
if isinstance(df[col_name], pd.SparseSeries):
column_index = df[col_name].sp_index
if isinstance(column_index, BlockIndex):
column_index = column_index.to_int_index()
ix = column_index.indices
data.append(df[col_name].sp_values)
row.append(ix)
col.append(len(df[col_name].sp_values) * [i])
else:
data.append(df[col_name].values)
row.append(np.array(range(0, num_rows)))
col.append(np.array(num_rows * [i]))
data_f = np.concatenate(data)
row_f = np.concatenate(row)
col_f = np.concatenate(col)
arr = coo_matrix((data_f, (row_f, col_f)), df.shape, dtype=np.float64)
return arr.tocsr()
```

##
Answer #4:

As of Pandas version 0.25 `SparseSeries`

and `SparseDataFrame`

are deprecated. DataFrames now support Sparse Dtypes for columns with sparse data. Sparse methods are available through `sparse`

accessor, so conversion one-liner now looks like this:

```
sparse_matrix = scipy.sparse.csr_matrix(df.sparse.to_coo())
```

##
Answer #5:

Here’s a solution that fills the sparse matrix column by column (assumes you can fit at least one column to memory).

```
import pandas as pd
import numpy as np
from scipy.sparse import lil_matrix
def sparse_df_to_array(df):
""" Convert sparse dataframe to sparse array csr_matrix used by
scikit learn. """
arr = lil_matrix(df.shape, dtype=np.float32)
for i, col in enumerate(df.columns):
ix = df[col] != 0
arr[np.where(ix), i] = df.ix[ix, col]
return arr.tocsr()
```

##
Answer #6:

**EDIT**: This method is actually having a dense representation at some stage, so it doesn’t solve the question.

You should be able to use the experimental `.to_coo()`

method in pandas [1] in the following way:

```
df, idx_rows, idx_cols = df.stack().to_sparse().to_coo()
df = df.tocsr()
```

This method, instead of taking a `DataFrame`

(rows / columns) it takes a `Series`

with rows and columns in a `MultiIndex`

(this is why you need the `.stack()`

method). This `Series`

with the `MultiIndex`

needs to be a `SparseSeries`

, and even if your input is a `SparseDataFrame`

, `.stack()`

returns a regular `Series`

. So, you need to use the `.to_sparse()`

method before calling `.to_coo()`

.

The `Series`

returned by `.stack()`

, even if it’s not a `SparseSeries`

only contains the elements that are not null, so it shouldn’t take more memory than the sparse version (at least with `np.nan`

when the type is `np.float`

).