I have the following DataFrame:
Col1 Col2 Col3 Type 0 1 2 3 1 1 4 5 6 1 ... 20 7 8 9 2 21 10 11 12 2 ... 45 13 14 15 3 46 16 17 18 3 ...
The DataFrame is read from a csv file. All rows which have
Type 1 are on top, followed by the rows with
Type 2, followed by the rows with
Type 3, etc.
I would like to shuffle the order of the DataFrame’s rows, so that all
Type‘s are mixed. A possible result could be:
Col1 Col2 Col3 Type 0 7 8 9 2 1 13 14 15 3 ... 20 1 2 3 1 21 10 11 12 2 ... 45 4 5 6 1 46 16 17 18 3 ...
How can I achieve this?
The idiomatic way to do this with Pandas is to use the
.sample method of your dataframe to sample all rows without replacement:
frac keyword argument specifies the fraction of rows to return in the random sample, so
frac=1 means return all rows (in random order).
If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.
df = df.sample(frac=1).reset_index(drop=True)
.reset_index from creating a column containing the old index entries.
Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean
id(df_old) is not the same as
id(df_new)), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:
$ python3 -m memory_profiler .test.py Filename: .test.py Line # Mem usage Increment Line Contents ================================================ 5 68.5 MiB 68.5 MiB @profile 6 def shuffle(): 7 847.8 MiB 779.3 MiB df = pd.DataFrame(np.random.randn(100, 1000000)) 8 847.9 MiB 0.1 MiB df = df.sample(frac=1).reset_index(drop=True)
You can simply use sklearn for this
from sklearn.utils import shuffle df = shuffle(df)
You can shuffle the rows of a dataframe by indexing with a shuffled index. For this, you can eg use
np.random.choice is also a possibility):
In : df = pd.read_csv(StringIO(s), sep="s+") In : df Out: Col1 Col2 Col3 Type 0 1 2 3 1 1 4 5 6 1 20 7 8 9 2 21 10 11 12 2 45 13 14 15 3 46 16 17 18 3 In : df.iloc[np.random.permutation(len(df))] Out: Col1 Col2 Col3 Type 46 16 17 18 3 45 13 14 15 3 20 7 8 9 2 0 1 2 3 1 1 4 5 6 1 21 10 11 12 2
If you want to keep the index numbered from 1, 2, .., n as in your example, you can simply reset the index:
np.random.shuffle(ndarray) can do the job.
So, in your case
DataFrame, under the hood, uses NumPy ndarray as data holder. (You can check from DataFrame source code)
So if you use
np.random.shuffle(), it would shuffles the array along the first axis of a multi-dimensional array. But index of the
DataFrame remains unshuffled.
Though, there are some points to consider.
- function returns none. In case you want to keep a copy of the original object, you have to do so before you pass to the function.
sklearn.utils.shuffle(), as user tj89 suggested, can designate
random_statealong with another option to control output. You may want that for dev purpose.
sklearn.utils.shuffle()is faster. But WILL SHUFFLE the axis info(index, column) of the
DataFramealong with the
nd = sklearn.utils.shuffle(nd)
0.10793248389381915 sec. 8x faster
df = sklearn.utils.shuffle(df)
0.3183923360193148 sec. 3x faster
import timeit setup = ''' import numpy as np import pandas as pd import sklearn nd = np.random.random((1000, 100)) df = pd.DataFrame(nd) ''' timeit.timeit('nd = sklearn.utils.shuffle(nd)', setup=setup, number=1000) timeit.timeit('np.random.shuffle(nd)', setup=setup, number=1000) timeit.timeit('df = sklearn.utils.shuffle(df)', setup=setup, number=1000) timeit.timeit('np.random.shuffle(df.values)', setup=setup, number=1000)
(I don’t have enough reputation to comment this on the top post, so I hope someone else can do that for me.) There was a concern raised that the first method:
made a deep copy or just changed the dataframe. I ran the following code:
print(hex(id(df))) print(hex(id(df.sample(frac=1)))) print(hex(id(df.sample(frac=1).reset_index(drop=True))))
and my results were:
0x1f8a784d400 0x1f8b9d65e10 0x1f8b9d65b70
which means the method is not returning the same object, as was suggested in the last comment. So this method does indeed make a shuffled copy.
What is also useful, if you use it for Machine_learning and want to seperate always the same data, you could use:
this makes sure, that you keep your random choice always replicatable
AFAIK the simplest solution is:
df_shuffled = df.reindex(np.random.permutation(df.index))
Following could be one of ways:
dataframe = dataframe.sample(frac=1, random_state=42).reset_index(drop=True)
frac=1 means all rows of a dataframe
random_state=42 means keeping same order in each execution
reset_index(drop=True) means reinitialize index for randomized dataframe