How do I create test and train samples from one dataframe with pandas?

Posted on

Question :

How do I create test and train samples from one dataframe with pandas?

I have a fairly large dataset in the form of a dataframe and I was wondering how I would be able to split the dataframe into two random samples (80% and 20%) for training and testing.

Thanks!

Asked By: tooty44

||

Answer #1:

I would just use numpy’s randn:

In [11]: df = pd.DataFrame(np.random.randn(100, 2))

In [12]: msk = np.random.rand(len(df)) < 0.8

In [13]: train = df[msk]

In [14]: test = df[~msk]

And just to see this has worked:

In [15]: len(test)
Out[15]: 21

In [16]: len(train)
Out[16]: 79
Answered By: Andy Hayden

Answer #2:

scikit learn’s train_test_split is a good one – it will split both numpy arrays as dataframes.

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)
Answered By: gobrewers14

Answer #3:

Pandas random sample will also work

train=df.sample(frac=0.8,random_state=200) #random state is a seed value
test=df.drop(train.index)
Answered By: PagMax

Answer #4:

I would use scikit-learn’s own training_test_split, and generate it from the index

from sklearn.model_selection import train_test_split


y = df.pop('output')
X = df

X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train
Answered By: Napitupulu Jon

Answer #5:

There are many ways to create a train/test and even validation samples.

Case 1: classic way train_test_split without any options:

from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)

Case 2: case of a very small datasets (<500 rows): in order to get results for all your lines with this cross-validation. At the end, you will have one prediction for each line of your available training set.

from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
    reg = RandomForestRegressor(n_estimators=50, random_state=0)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf = reg.fit(X_train, y_train)
    y_hat = clf.predict(X_test)
    y_hat_all.append(y_hat)

Case 3a: Unbalanced datasets for classification purpose. Following the case 1, here is the equivalent solution:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)

Case 3b: Unbalanced datasets for classification purpose. Following the case 2, here is the equivalent solution:

from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
    reg = RandomForestRegressor(n_estimators=50, random_state=0)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf = reg.fit(X_train, y_train)
    y_hat = clf.predict(X_test)
    y_hat_all.append(y_hat)

Case 4: you need to create a train/test/validation sets on big data to tune hyperparameters (60% train, 20% test and 20% val).

from sklearn.model_selection import train_test_split
X_train, X_test_val, y_train, y_test_val = train_test_split(X, y, test_size=0.6)
X_test, X_val, y_test, y_val = train_test_split(X_test_val, y_test_val, stratify=y, test_size=0.5)
Answered By: yannick_leo

Answer #6:

No need to convert to numpy. Just use a pandas df to do the split and it will return a pandas df.

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

And if you want to split x from y

X_train, X_test, y_train, y_test = train_test_split(df[list_of_x_cols], df[y_col],test_size=0.2)

And if you want to split the whole df

X, y = df[list_of_x_cols], df[y_col]
Answered By: Nosey

Answer #7:

You can use below code to create test and train samples :

from sklearn.model_selection import train_test_split
trainingSet, testSet = train_test_split(df, test_size=0.2)

Test size can vary depending on the percentage of data you want to put in your test and train dataset.

Answered By: user1775015

Answer #8:

There are many valid answers. Adding one more to the bunch.
from sklearn.cross_validation import train_test_split

#gets a random 80% of the entire set
X_train = X.sample(frac=0.8, random_state=1)
#gets the left out portion of the dataset
X_test = X.loc[~df_model.index.isin(X_train.index)]
Answered By: Abhi

Leave a Reply

Your email address will not be published. Required fields are marked *