Splitting data frame in to test and train data sets

Use pandas to create two data frames: train_df and test_df, where train_df has 80% of the data chosen uniformly at random without replacement.

Here, what does "data chosen uniformly at random without replacement" mean?

Also, How can i do it?

Thanks

Solution

"chosen uniformly at random" means that each row has an equal probability of being selected into the 80%

"without replacement" means that each row is only considered once. Once it is assigned to a training or test set it is not

For example, consider the data below:

A            B

0            5
1            6
2            7
3            8
4            9

If this dataset is being split into an 80% training set and 20% test set, then we will end up with a training set of 4 rows (80% of the data) and a test set of 1 row (20% of the data)

Without Replacement Assume the first row is assigned to the training set. Now the training set is:

A            B

0            5

When the next row is assigned to training or test, it will be selected from the remaining rows: A B

1            6
2            7
3            8
4            9

With Replacement Assume the first row is assigned to the training set. Now the training set is:

A            B

0            5

But the next row will be assigned using the entire dataset (i.e. The first row has been placed back in the original dataset)

A            B

0            5
1            6
2            7
3            8
4            9

How can you can do this: You can use the train_test_split function from scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Or you could do this using pandas and Numpy:

df['random_number'] = np.random.randn(length_of_df)

train = df[df['random_number'] <= 0.8]
test = df[df['random_number'] > 0.8]