Search code examples
rdataframesubsetsample

Sample random row from data frame that isn't subset of previous sample in R


Newbie here. My problem has 2 steps. I would like to sample a number of rows (3) from a data frame and then take a second sample (1 row) which is not in the first sample.

#here is my data frame
df = data.frame(matrix(rnorm(20), nrow=10))

#here is my first sample with 3 rows
sample_1<- df[sample(nrow(df), 3), ]


#here is my second sample
sample_2 <- df[sample(nrow(df), 1), ]

I want the second sample to not be a part of the first sample.

I appreciate your help. Thank you!

Hello! Thanks once again for the response to this. I have a follow up question to this. If I needed to run this on a large dataset, using a FOR loop, so that it ran the code for every iteration but selected a different group each time the loop ran, would that be possible?


Solution

  • @GregorThomas' suggestion is likely best, given what we know: sample four rows, and then take one row as your sample_2 and the rest are in sample_1.

    set.seed(42)
    df <- data.frame(matrix(rnorm(20), nrow=10))
    ( samples <- sample(nrow(df), size = 4) )
    # [1] 6 8 4 9
    sample_1 <- df[ samples[-1], ]
    sample_2 <- df[ samples[1],,drop = FALSE ]
    sample_1
    #            X1         X2
    # 8 -0.09465904 -2.6564554
    # 4  0.63286260 -0.2787888
    # 9  2.01842371 -2.4404669
    sample_2
    #           X1        X2
    # 6 -0.1061245 0.6359504
    

    However, if for some reason your sampling requires something else, then you can restrict your second sampling to those not included in the first. A good way is if you have a unique id of some form in each row:

    df$id <- seq_len(nrow(df))
    df
    #             X1         X2 id
    # 1   1.37095845  1.3048697  1
    # 2  -0.56469817  2.2866454  2
    # 3   0.36312841 -1.3888607  3
    # 4   0.63286260 -0.2787888  4
    # 5   0.40426832 -0.1333213  5
    # 6  -0.10612452  0.6359504  6
    # 7   1.51152200 -0.2842529  7
    # 8  -0.09465904 -2.6564554  8
    # 9   2.01842371 -2.4404669  9
    # 10 -0.06271410  1.3201133 10
    
    sample_1 <- df[sample(nrow(df), 3), ]
    sample_1
    #           X1         X2 id
    # 6 -0.1061245  0.6359504  6
    # 2 -0.5646982  2.2866454  2
    # 5  0.4042683 -0.1333213  5
    subdf <- df[ !df$id %in% sample_1$id, ]
    sample_2 <- subdf[sample(nrow(subdf), 1), ]
    sample_2
    #         X1         X2 id
    # 7 1.511522 -0.2842529  7