Search code examples
rprobabilitysample

Sample function with probabilities dependent on another column


I am using R and would like to randomly assign a status to rows, with a probability that depends on a particular column value.

My table (called df) looks like this:

Year    Level  Country    
2018    1      Eng
2018    2      Wal
2018    1      Eng
2019    3      Sco

etc... There are three possible levels (1,2,3), all years between 2011 and 2022, and numerous countries. Rows can be duplicated, and there are thousands of rows.

I want to add a new column, with a value of TRUE or FALSE. I have a separate table (called Table1) telling me the number of TRUE values there should be for each year:

2018  2019  2020  2021  2022
 123   100    99   189   211

However, only rows with certain conditions can get a TRUE value - they have to be from 2018 onwards, they have to have Eng as their country, and if the year is 2018 then they have 2 as their Level (other years the Level doesn't matter).

I would like the original table to have this additional column for all rows, where the number of TRUES in each year match those in Table1.

I've tried various things. I started trying to combine conditions with "&" together with a sample() function.

newtable <- df$Country == "Eng" &
            df$Level = 2 & 
            df$Year >= 2018 &
            sample(c(T,F),nrows(df),replace=T,prob=c(???))

but then got stuck on how the probabilities could differ between years, as well as how to convert the raw numbers in Table1 into probabilities.

I'm also stuck on how to apply a slightly different filter to 2018 (Level 2 only) - I can use this dplyr:filter `

filter(!(Year==2018 and Level!=2))

But I'm not sure how to incorporate this into my above function. `

Any help much appreciated.


Solution

  • To indicate which rows are eligible to be assigned TRUE, you can use the following expression

    df$eligible <- (df$Year > 2018 & df$Country == "Eng") | 
        (df$Year == 2018 & df$Country == "Eng" & df$Level == 2)
    

    which says a row is eligible if Year is strictly greater than 2018 (i.e. 2019 or greater) and Country is "Eng", OR Year is 2018 and Country is "Eng" and Level is 2.

    Now, allocate TRUEs among eligible rows. There are several ways to do this. One way is to split eligible rows by Year and allocate the TRUEs according to the values in the table.

    eligible <- df[df$eligible, ]
    trues <- c(123, 100, 99, 189, 211)
    

    Now, using mapply:

    new_df <- mapply(FUN = function(x, n) {
      x$new_column <- FALSE
      x$new_column[sample(1:nrow(x), n)] <- TRUE
      x
    }, x = split(eligible, eligible$Year), n = trues, SIMPLIFY = FALSE)
    

    we create a function that takes two arguments: a dataframe of eligible rows, and the number of TRUEs for that dataframe, and returns the dataframe with a new column with randomly allocated TRUEs. mapply then applies the values of split(eligible, eligible$Year) (a list with five dataframes, each corresponding to a year between 2018 and 2022) and trues (a vector with five elements) in parallel to the function.

    Finishing off,

    new_df <- do.call(rbind, new_df)
    new_df <- rbind(new_df, cbind(df[!df$eligible, ], new_column = FALSE))
    

    we row-bind the output of mapply(...) (a list with five dataframes) together, and then rbind that to the non-eligible rows with the new column appended where the values are, by definition, FALSE.