Search code examples
rdplyrduplicatesdata-transform

r randomly assign 1 or 0 based on conditons


For a dataset like this

    MainID    SubID     DOB            BMI
    1234      1234_A    Feb-19-2024    10.1

    1235      1235_A    Jan-11-2023    17.23
    1235      1235_B    Jan-11-2023    19.11

    5136      5136_A    May-17-2021    21.87
    5136      5136_B    May-17-2021    14.18
    5136      5136_C    May-17-2021    18.11

    3357      3357-A    Oct-06-2023    24.10   

    9124      9124-B    July-01-2021   12.09
    9124      9124-B    July-01-2021   15.06

I am trying to randomly assign a value 0 or 1 only if the values in MainID and DOB are same SubID is different. Expecting a dataset like this

    MainID    SubID     DOB             BMI   Col1
    1234      1234_A    Feb-19-2024     10.1  0

    1235      1235_A    Jan-11-2023     17.23 0
    1235      1235_B    Jan-11-2023     19.11 1

    3357      3357-A    Oct-06-2023     24.10 0

    5136      5136_A    May-17-2021     21.87 0  
    5136      5136_B    May-17-2021     14.18 0
    5136      5136_C    May-17-2021     18.11 1

    9124      9124-B    July-01-2021    12.09 0
    9124      9124-B    July-01-2021    15.06 0

Here only rows with ID 1235 and 5136 are assigned 0 or 1 because the repeated rows have same MainID, DOB and different SubID.

I tried options with ifelse and duplicated(df[c("MainID", "DOB")]) but this did not work. Any suggestion is much appreciated. Thanks in advance.


Solution

  • It's not clear if you want 0:1 sampled with replacement. I'm assuming always 0:1 and with-replacement only when the number of rows in a group is more than 2.

    Up front, both dplyr and base-R code have a snippet similar to this.

    sample(c(0:1, sample(0:1, size = n()-2, replace = (n() > 4))))
    
    • we need to make sure that we have at least one 0 and one 1 in the bunch; if we did sample(0:1, size=n(), replace=TRUE), it is feasible (though improbable) that we can have all 0s or all 1s, and I'm inferring that we want/need at least one of each;
    • because of that we force at least one of each with the c(0:1, ..), forcing the presence of each number;
    • and since we force those two, we need the inner sample to produce a size of n() - 2 (2 fewer than there are rows); if there are only four rows, then the outer 0:1 and the inner 0:1 will suffice, so we don't technically need replacement, ergo replace=(n() > 4)
    • now that we have a vector that is guaranteed to start with 0:1 and may have zero or more 0s and 1s based on the number of rows in the group, we need to randomly reorder those (because we don't want the first two to be always 0:1

    dplyr

    library(dplyr)
    set.seed(42)
    quux |>
      mutate(
        Col1 = if (n_distinct(SubID) > 1) {
            sample(c(0:1, sample(0:1, size = n()-2, replace = (n() > 4))))
          } else 0,
        .by = c(MainID, DOB)
      )
    #   MainID  SubID          DOB   BMI Col1
    # 1   1234 1234_A  Feb-19-2024 10.10    0
    # 2   1235 1235_A  Jan-11-2023 17.23    0
    # 3   1235 1235_B  Jan-11-2023 19.11    1
    # 4   5136 5136_A  May-17-2021 21.87    0
    # 5   5136 5136_B  May-17-2021 14.18    1
    # 6   5136 5136_C  May-17-2021 18.11    0
    # 7   3357 3357-A  Oct-06-2023 24.10    0
    # 8   9124 9124-B July-01-2021 12.09    0
    # 9   9124 9124-B July-01-2021 15.06    0
    

    base R

    ave(seq_len(nrow(quux)), quux[, c("MainID", "DOB")],
        FUN = function(ind) {
          nr <- length(ind)
          nsub <- length(unique(quux$SubID[ind]))
          if (nsub > 1) sample(c(0:1, sample(0:1, size = nr-2, replace = (nr>4)))) else 0
        })
    # [1] 0 0 1 0 1 0 0 0 0
    

    Data

    quux <- structure(list(MainID = c(1234L, 1235L, 1235L, 5136L, 5136L, 5136L, 3357L, 9124L, 9124L), SubID = c("1234_A", "1235_A", "1235_B", "5136_A", "5136_B", "5136_C", "3357-A", "9124-B", "9124-B"), DOB = c("Feb-19-2024", "Jan-11-2023", "Jan-11-2023", "May-17-2021", "May-17-2021", "May-17-2021", "Oct-06-2023", "July-01-2021", "July-01-2021"), BMI = c(10.1, 17.23, 19.11, 21.87, 14.18, 18.11, 24.1, 12.09, 15.06)), class = "data.frame", row.names = c(NA, -9L))