Search code examples
rsampling

Clarity regarding the use of sample() function in R to set up training and test sets for ML.


I am trying to understand this example of the KNN algorithm in R by Datacamp: Machine Learning in R for beginners

I am having trouble understanding how they execute sampling to set up the training and test data sets.

I am able to follow the code till this line:

ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.67, 0.33))

It is my understanding that this creates a vector of length equal to nrow(iris), with the vector values being either 1 or 2 and the probabilities of these values being chosen being 0.67 and 0.33, respectively.

Thus, we get the following output:

> ind
  [1] 1 1 2 1 2 2 1 2 1 1 1 1 2 2 1 1 1 1 2 2 1 1 1 1 2 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
 [58] 1 1 1 1 2 1 1 1 2 1 2 1 1 2 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1 2 1 2 2 1 2 1 1 1
[115] 1 2 1 1 1 2 1 2 1 1 2 1 1 2 1 2 1 2 1 1 2 1 1 1 1 1 2 1 1 1 2 1 1 2 1 1 

In the next step, they create the training set using the following code:

iris.training <- iris [ind==1, 1:4]

This line presumably produces a data frame consisting of all the rows for which ind == 1.

head(iris.training)
   Sepal.Length Sepal.Width Petal.Length Petal.Width
1           5.1         3.5          1.4         0.2
2           4.9         3.0          1.4         0.2
4           4.6         3.1          1.5         0.2
7           4.6         3.4          1.4         0.3
9           4.4         2.9          1.4         0.2
10          4.9         3.1          1.5         0.1

My question is how are the variable ind and the iris data set related. That is, how does R know which lines to pick up (which lines have ind == 1) from the original iris data set, as there seems to be no connection between ind and the iris data set. The only time the iris data set is mentioned when setting up ind is to determine the sample size (number of samples to be chosen) using nrow(iris) in ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.67, 0.33)).


Solution

  • My question is how are the variable ind and the iris data set related.

    They're not, but they needn't be. For example, there is no intrinsic relationship between the numbers 1-5 and the iris dataset, yet

    iris[1:5, ]
    #>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    #> 1          5.1         3.5          1.4         0.2  setosa
    #> 2          4.9         3.0          1.4         0.2  setosa
    #> 3          4.7         3.2          1.3         0.2  setosa
    #> 4          4.6         3.1          1.5         0.2  setosa
    #> 5          5.0         3.6          1.4         0.2  setosa
    

    Created on 2018-08-05 by the reprex package (v0.2.0).

    It would be a bit clearer to say ind <- sample(c(TRUE, FALSE), nrow(iris), replace=TRUE, prob=c(0.67, 0.33)) and then iris[ind, ] to emphasize that ind is an index of rows to select rather than the result a comparison of variables within iris.