Shuffle DataFrame rows

Posted on

Question :

Shuffle DataFrame rows

I have the following DataFrame:

    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
20     7     8     9     2
21    10    11    12     2
45    13    14    15     3
46    16    17    18     3

The DataFrame is read from a csv file. All rows which have Type 1 are on top, followed by the rows with Type 2, followed by the rows with Type 3, etc.

I would like to shuffle the order of the DataFrame’s rows, so that all Type‘s are mixed. A possible result could be:

    Col1  Col2  Col3  Type
0      7     8     9     2
1     13    14    15     3
20     1     2     3     1
21    10    11    12     2
45     4     5     6     1
46    16    17    18     3

How can I achieve this?

Asked By: JNevens


Answer #1:

The idiomatic way to do this with Pandas is to use the .sample method of your dataframe to sample all rows without replacement:


The frac keyword argument specifies the fraction of rows to return in the random sample, so frac=1 means return all rows (in random order).

If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.

df = df.sample(frac=1).reset_index(drop=True)

Here, specifying drop=True prevents .reset_index from creating a column containing the old index entries.

Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old) is not the same as id(df_new)), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:

$ python3 -m memory_profiler

Line #    Mem usage    Increment   Line Contents
     5     68.5 MiB     68.5 MiB   @profile
     6                             def shuffle():
     7    847.8 MiB    779.3 MiB       df = pd.DataFrame(np.random.randn(100, 1000000))
     8    847.9 MiB      0.1 MiB       df = df.sample(frac=1).reset_index(drop=True)

Answered By: Kris

Answer #2:

You can simply use sklearn for this

from sklearn.utils import shuffle
df = shuffle(df)
Answered By: tj89

Answer #3:

You can shuffle the rows of a dataframe by indexing with a shuffled index. For this, you can eg use np.random.permutation (but np.random.choice is also a possibility):

In [12]: df = pd.read_csv(StringIO(s), sep="s+")

In [13]: df
    Col1  Col2  Col3  Type
0      1     2     3     1
1      4     5     6     1
20     7     8     9     2
21    10    11    12     2
45    13    14    15     3
46    16    17    18     3

In [14]: df.iloc[np.random.permutation(len(df))]
    Col1  Col2  Col3  Type
46    16    17    18     3
45    13    14    15     3
20     7     8     9     2
0      1     2     3     1
1      4     5     6     1
21    10    11    12     2

If you want to keep the index numbered from 1, 2, .., n as in your example, you can simply reset the index: df_shuffled.reset_index(drop=True)

Answered By: joris

Answer #4:

TL;DR: np.random.shuffle(ndarray) can do the job.
So, in your case


DataFrame, under the hood, uses NumPy ndarray as data holder. (You can check from DataFrame source code)

So if you use np.random.shuffle(), it would shuffles the array along the first axis of a multi-dimensional array. But index of the DataFrame remains unshuffled.

Though, there are some points to consider.

  • function returns none. In case you want to keep a copy of the original object, you have to do so before you pass to the function.
  • sklearn.utils.shuffle(), as user tj89 suggested, can designate random_state along with another option to control output. You may want that for dev purpose.
  • sklearn.utils.shuffle() is faster. But WILL SHUFFLE the axis info(index, column) of the DataFrame along with the ndarray it contains.

Benchmark result

between sklearn.utils.shuffle() and np.random.shuffle().


nd = sklearn.utils.shuffle(nd)

0.10793248389381915 sec. 8x faster


0.8897626010002568 sec


df = sklearn.utils.shuffle(df)

0.3183923360193148 sec. 3x faster


0.9357550159329548 sec

Conclusion: If it is okay to axis info(index, column) to be shuffled along with ndarray, use sklearn.utils.shuffle(). Otherwise, use np.random.shuffle()

used code

import timeit
setup = '''
import numpy as np
import pandas as pd
import sklearn
nd = np.random.random((1000, 100))
df = pd.DataFrame(nd)

timeit.timeit('nd = sklearn.utils.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(nd)', setup=setup, number=1000)
timeit.timeit('df = sklearn.utils.shuffle(df)', setup=setup, number=1000)
timeit.timeit('np.random.shuffle(df.values)', setup=setup, number=1000)

Answered By: haku

Answer #5:

(I don’t have enough reputation to comment this on the top post, so I hope someone else can do that for me.) There was a concern raised that the first method:


made a deep copy or just changed the dataframe. I ran the following code:


and my results were:


which means the method is not returning the same object, as was suggested in the last comment. So this method does indeed make a shuffled copy.

Answered By: NotANumber

Answer #6:

What is also useful, if you use it for Machine_learning and want to seperate always the same data, you could use:

df.sample(n=len(df), random_state=42)

this makes sure, that you keep your random choice always replicatable

Answered By: PV8

Answer #7:

AFAIK the simplest solution is:

df_shuffled = df.reindex(np.random.permutation(df.index))
Answered By: Ido Cohn

Answer #8:

Following could be one of ways:

dataframe = dataframe.sample(frac=1, random_state=42).reset_index(drop=True)


frac=1 means all rows of a dataframe

random_state=42 means keeping same order in each execution

reset_index(drop=True) means reinitialize index for randomized dataframe

Answered By: Anshul Singhal

Leave a Reply

Your email address will not be published. Required fields are marked *