Search code examples
rdataframetidyverseprobability

How to create a variable with value based on probabilities in another data frame in R?


I have created a table called data. This table contains a non-unique ID field.

data <- data.frame(ID = sample(c(1:5), 10, replace = T))

I have another table called probabilities, which contains matches for the ID field, corresponding ratios and names:

probabilities <- data.frame(ID = c(1,1,2,2,3,3,4,4,4,5), ratio = c(0.9, 0.1, 0.4, 0.6, 0.8, 0.2, 0.3, 0.3, 0.4, 1.0), name = c("A", "B", "A", "C", "F", "G", "B", "C", "G", "F"))

I am trying to create a new variable called name in the data table. This will be populated with the name variable from the probabilities table based on the ratio column.

For example, any ID of 1 in the data table should have a 90% chance of being A, and 10% chance of being B. An ID of 4 should have a 30% change of being B, a 30% chance of being C and a 40% chance of being G, and so on.

Does anyone know how this can be achieved?

I have tried the below but am getting an error:

#load packages
library(dplyr)


#create new variable called name
data <- data %>% 
  mutate(name = sample(probabilities$name[ID=probabilities$ID],
                       size = n(),
                       prop = probabilities$ratio[ID=probabilities$ID],
                       replace = TRUE))

Error in mutate(): ! Problem while computing name = sample(...). Caused by error in sample(): ! unused argument (prop = probabilities$ratio[name = probabilities$name])


Solution

  • base R solution, using sapply() and sample():

    data$name <- sapply( data$ID, function(ID) sample(x = probabilities[probabilities$ID==ID,"name"],prob = probabilities[probabilities$ID==ID,"ratio"],size = 1))