Search code examples
rmachine-learningdecision-tree

Split a data set into a training and test set with runif


I have a question regarding a command. We used in a class runif to create a training set, that should contain 50% of the data set. (we developed a decision tree based on this training set). But I still can't understand the logic behind this command, could someone explain to me how this works?

I understand the decision trees, and the logic behind splitting up a data set, my question is just explicitly about how this command works.

inTrain <- runif(nrow(USArrests)) < 0.5

Solution

  • You have a dataset named USArrests with length nrow(USArrests), let's say for the sake of simplification 100. So runif(nrow(USArrests)) creates 100 uniform distributed random numbers i.e. for every row in your dataset one number.

    Next your expression runif(nrow(USArrests)) < 0.5 checks, if the number is < 0.5 or not returning TRUE or FALSE. This gives you a logical vector of length 100 (or nrow(USArrests)) that indicates, if a row belongs to the training or to the test dataset.

    It's not shown but finally you select your training data by

    USArrests[inTrain,]
    

    and your test data by

    USArrests[-inTrain,]