Question :
I have a fairly large dataset in the form of a dataframe and I was wondering how I would be able to split the dataframe into two random samples (80% and 20%) for training and testing.
Thanks!
Answer #1:
I would just use numpy’s randn
:
In [11]: df = pd.DataFrame(np.random.randn(100, 2))
In [12]: msk = np.random.rand(len(df)) < 0.8
In [13]: train = df[msk]
In [14]: test = df[~msk]
And just to see this has worked:
In [15]: len(test)
Out[15]: 21
In [16]: len(train)
Out[16]: 79
Answer #2:
scikit learn’s train_test_split
is a good one – it will split both numpy arrays as dataframes.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
Answer #3:
Pandas random sample will also work
train=df.sample(frac=0.8,random_state=200) #random state is a seed value
test=df.drop(train.index)
Answer #4:
I would use scikit-learn’s own training_test_split, and generate it from the index
from sklearn.model_selection import train_test_split
y = df.pop('output')
X = df
X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train
Answer #5:
There are many ways to create a train/test and even validation samples.
Case 1: classic way train_test_split
without any options:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)
Case 2: case of a very small datasets (<500 rows): in order to get results for all your lines with this cross-validation. At the end, you will have one prediction for each line of your available training set.
from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
reg = RandomForestRegressor(n_estimators=50, random_state=0)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = reg.fit(X_train, y_train)
y_hat = clf.predict(X_test)
y_hat_all.append(y_hat)
Case 3a: Unbalanced datasets for classification purpose. Following the case 1, here is the equivalent solution:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)
Case 3b: Unbalanced datasets for classification purpose. Following the case 2, here is the equivalent solution:
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
reg = RandomForestRegressor(n_estimators=50, random_state=0)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = reg.fit(X_train, y_train)
y_hat = clf.predict(X_test)
y_hat_all.append(y_hat)
Case 4: you need to create a train/test/validation sets on big data to tune hyperparameters (60% train, 20% test and 20% val).
from sklearn.model_selection import train_test_split
X_train, X_test_val, y_train, y_test_val = train_test_split(X, y, test_size=0.6)
X_test, X_val, y_test, y_val = train_test_split(X_test_val, y_test_val, stratify=y, test_size=0.5)
Answer #6:
No need to convert to numpy. Just use a pandas df to do the split and it will return a pandas df.
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)
And if you want to split x from y
X_train, X_test, y_train, y_test = train_test_split(df[list_of_x_cols], df[y_col],test_size=0.2)
And if you want to split the whole df
X, y = df[list_of_x_cols], df[y_col]
Answer #7:
You can use below code to create test and train samples :
from sklearn.model_selection import train_test_split
trainingSet, testSet = train_test_split(df, test_size=0.2)
Test size can vary depending on the percentage of data you want to put in your test and train dataset.
Answer #8:
There are many valid answers. Adding one more to the bunch.
from sklearn.cross_validation import train_test_split
#gets a random 80% of the entire set
X_train = X.sample(frac=0.8, random_state=1)
#gets the left out portion of the dataset
X_test = X.loc[~df_model.index.isin(X_train.index)]