I am using R and would like to randomly assign a status to rows, with a probability that depends on a particular column value.
My table (called df) looks like this:
Year Level Country
2018 1 Eng
2018 2 Wal
2018 1 Eng
2019 3 Sco
etc... There are three possible levels (1,2,3), all years between 2011 and 2022, and numerous countries. Rows can be duplicated, and there are thousands of rows.
I want to add a new column, with a value of TRUE or FALSE. I have a separate table (called Table1) telling me the number of TRUE values there should be for each year:
2018 2019 2020 2021 2022
123 100 99 189 211
However, only rows with certain conditions can get a TRUE value - they have to be from 2018 onwards, they have to have Eng as their country, and if the year is 2018 then they have 2 as their Level (other years the Level doesn't matter).
I would like the original table to have this additional column for all rows, where the number of TRUES in each year match those in Table1.
I've tried various things. I started trying to combine conditions with "&" together with a sample() function.
newtable <- df$Country == "Eng" &
df$Level = 2 &
df$Year >= 2018 &
sample(c(T,F),nrows(df),replace=T,prob=c(???))
but then got stuck on how the probabilities could differ between years, as well as how to convert the raw numbers in Table1 into probabilities.
I'm also stuck on how to apply a slightly different filter to 2018 (Level 2 only) - I can use this dplyr:filter `
filter(!(Year==2018 and Level!=2))
But I'm not sure how to incorporate this into my above function. `
Any help much appreciated.
To indicate which rows are eligible to be assigned TRUE
, you can use the following expression
df$eligible <- (df$Year > 2018 & df$Country == "Eng") |
(df$Year == 2018 & df$Country == "Eng" & df$Level == 2)
which says a row is eligible if Year
is strictly greater than 2018 (i.e. 2019 or greater) and Country
is "Eng", OR Year
is 2018 and Country
is "Eng" and Level
is 2.
Now, allocate TRUE
s among eligible rows. There are several ways to do this. One way is to split eligible rows by Year
and allocate the TRUE
s according to the values in the table.
eligible <- df[df$eligible, ]
trues <- c(123, 100, 99, 189, 211)
Now, using mapply
:
new_df <- mapply(FUN = function(x, n) {
x$new_column <- FALSE
x$new_column[sample(1:nrow(x), n)] <- TRUE
x
}, x = split(eligible, eligible$Year), n = trues, SIMPLIFY = FALSE)
we create a function that takes two arguments: a dataframe of eligible rows, and the number of TRUE
s for that dataframe, and returns the dataframe with a new column with randomly allocated TRUE
s. mapply
then applies the values of split(eligible, eligible$Year)
(a list with five dataframes, each corresponding to a year between 2018 and 2022) and trues
(a vector with five elements) in parallel to the function.
Finishing off,
new_df <- do.call(rbind, new_df)
new_df <- rbind(new_df, cbind(df[!df$eligible, ], new_column = FALSE))
we row-bind the output of mapply(...)
(a list with five dataframes) together, and then rbind
that to the non-eligible rows with the new column appended where the values are, by definition, FALSE
.