I have a question regarding a command. We used in a class runif to create a training set, that should contain 50% of the data set. (we developed a decision tree based on this training set). But I still can't understand the logic behind this command, could someone explain to me how this works?
I understand the decision trees, and the logic behind splitting up a data set, my question is just explicitly about how this command works.
inTrain <- runif(nrow(USArrests)) < 0.5
You have a dataset named USArrests
with length nrow(USArrests)
, let's say for the sake of simplification 100. So runif(nrow(USArrests))
creates 100 uniform distributed random numbers i.e. for every row in your dataset one number.
Next your expression runif(nrow(USArrests)) < 0.5
checks, if the number is < 0.5
or not returning TRUE
or FALSE
. This gives you a logical vector of length 100 (or nrow(USArrests)
) that indicates, if a row belongs to the training or to the test dataset.
It's not shown but finally you select your training data by
USArrests[inTrain,]
and your test data by
USArrests[-inTrain,]