Search code examples
pandassplittraining-data

Splitting data frame in to test and train data sets


Use pandas to create two data frames: train_df and test_df, where train_df has 80% of the data chosen uniformly at random without replacement.

Here, what does "data chosen uniformly at random without replacement" mean?

Also, How can i do it?

Thanks


Solution

  • "chosen uniformly at random" means that each row has an equal probability of being selected into the 80%

    "without replacement" means that each row is only considered once. Once it is assigned to a training or test set it is not

    For example, consider the data below:

    A            B
    
    0            5
    1            6
    2            7
    3            8
    4            9
    

    If this dataset is being split into an 80% training set and 20% test set, then we will end up with a training set of 4 rows (80% of the data) and a test set of 1 row (20% of the data)

    Without Replacement Assume the first row is assigned to the training set. Now the training set is:

    A            B
    
    0            5
    

    When the next row is assigned to training or test, it will be selected from the remaining rows: A B

    1            6
    2            7
    3            8
    4            9
    

    With Replacement Assume the first row is assigned to the training set. Now the training set is:

    A            B
    
    0            5
    

    But the next row will be assigned using the entire dataset (i.e. The first row has been placed back in the original dataset)

    A            B
    
    0            5
    1            6
    2            7
    3            8
    4            9
    

    How can you can do this: You can use the train_test_split function from scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

    Or you could do this using pandas and Numpy:

    df['random_number'] = np.random.randn(length_of_df)
    
    train = df[df['random_number'] <= 0.8]
    test = df[df['random_number'] > 0.8]