Search code examples
rsample

R. Sample rows following conditions (at random within a range of values; fixed within another range of values)


I have a dataframe A like this one:

SNP X Y
rs1 5 aa
rs2 1 bb
rs3 6 aa
rs4 2 bb
rs7 11 ft
rs8 3 hg
rs9 1.2 ff
rs10 2.2 cc
rs11 2.2 yh
rs362 3.2 hyu

Using R, I want to sample rows following 2 conditions: (1) keep all rows with values in X >= 5; (2) sample at random without replacement 2 rows with X > 0 and X < 5. I would get something like this:

SNP X Y
rs1 5 aa
rs2 1 bb
rs3 6 aa
rs7 11 ft
rs9 1.2 ff
rs362 3.2 hyu

I am trying something like:

A.1 = A[A$X >= 5,]
B.2 = A[sample(nrow(A), 2), ]

Solution

  • We can use the which function:

    set.seed(1) # reproducible
    d[c(which(d$X >= 5), sample(which(d$X > 0 & d$X < 5), 2)),]
    
      SNP    X  Y
    1 rs1  5.0 aa
    3 rs3  6.0 aa
    5 rs7 11.0 ft
    2 rs2  1.0 bb
    7 rs9  1.2 ff
    

    which(d$X >= 5) finds the rows in your data where X >= 5. Then, we find the rows where X > 0 & X < 5 using which again, and sample 2 from those rows. We then concatenate these two vectors of row indexes together.

    data

    d <- structure(list(SNP = c("rs1", "rs2", "rs3", "rs4", "rs7", "rs8", 
                                "rs9", "rs10", "rs11", "rs362"), 
                        X = c(5, 1, 6, 2, 11, 3, 1.2, 
                              2.2, 2.2, 3.2),
                        Y = c("aa", "bb", "aa", "bb", "ft", "hg", "ff", 
                              "cc", "yh", "hyu")), 
                   class = "data.frame", 
                   row.names = c(NA, -10L))