Search code examples
rfor-loopdplyrgroup-byplyr

Draw rows based on probabilities in R


I have two datasets. I want to assign for every row of second dataset a "Group" from first dataset based on "Probabilities" from first dataset. Logic what I want to do is as follows:

  1. Group 1st and 2nd dataset by period & socio demo.
  2. For every socio-demo Probability from 1st dataset draw randomly number of rows from 2nd dataset having the same period and socio
  3. Assign a "Group" (there are 4 groups) accordingly

my datasets:

Period Group socio_demo Probabilities
01.2021-03.2021 1 20yo_West_old_3child 60
01.2021-03.2021 2 20yo_West_old_3child 20
01.2021-03.2021 3 20yo_West_old_3child 10
01.2021-03.2021 4 20yo_West_old_3child 10
01.2021-03.2021 1 20yo_West_old_2child 50
01.2021-03.2021 2 20yo_West_old_2child 20
01.2021-03.2021 3 20yo_West_old_2child 25
01.2021-03.2021 4 20yo_West_old_2child 5
Period ID socio_demo Weight
01.2021-03.2021 1 20yo_West_old_3child 0.1
01.2021-03.2021 2 20yo_West_old_3child 0.11
01.2021-03.2021 3 20yo_West_old_3child 0.69
01.2021-03.2021 4 20yo_West_old_3child 0.14
01.2021-03.2021 5 20yo_West_old_3child 0.98
01.2021-03.2021 6 20yo_West_old_3child 0.89
01.2021-03.2021 7 20yo_West_old_3child 0.45
01.2021-03.2021 8 20yo_West_old_3child 0.1
01.2021-03.2021 9 20yo_West_old_3child 0.4
01.2021-03.2021 10 20yo_West_old_3child 0.3

I want to obtain:

Period ID socio_demo Weight Group
01.2021-03.2021 1 20yo_West_old_3child 0.1 1
01.2021-03.2021 2 20yo_West_old_3child 0.11 1
01.2021-03.2021 3 20yo_West_old_3child 0.69 2
01.2021-03.2021 4 20yo_West_old_3child 0.14 1
01.2021-03.2021 5 20yo_West_old_3child 0.98 4
01.2021-03.2021 6 20yo_West_old_3child 0.89 1
01.2021-03.2021 7 20yo_West_old_3child 0.45 3
01.2021-03.2021 8 20yo_West_old_3child 0.1 1
01.2021-03.2021 9 20yo_West_old_3child 0.4 2
01.2021-03.2021 10 20yo_West_old_3child 0.3 1

I tried:

df2 %>%
  left_join(df1) %>%
  tidyr::uncount(Probabilities) %>%
  group_by(Period, ID) %>%
  slice_sample(n = 1) %>%
  ungroup()

But unfortunately this is giving me error: *Long vectors are not yet supported. Requested output must be less than 2147483647. *

Thank you for helping


Solution

  • A sample of R code to carry out the task you described is provided here:

    library(dplyr)
    
    df1 <- data.frame(Period = c("01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021"),
                      Group = c(1,2,3,4,1,2,3,4),
                      socio_demo = c("20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_2child", "20yo_West_old_2child", "20yo_West_old_2child", "20yo_West_old_2child"),
                      Probabilities = c(60,20,10,10,50,20,25,5))
    
    df2 <- data.frame(Period = c("01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021"),
                      ID = c(1,2,3,4,5,6,7,8,9,10),
                      socio_demo = c("20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_3child"),
                      Weight = c(0.1,0.11,0.69,0.14,0.98,0.89,0.45,0.1,0.4,0.3))
    
    result_df <- df2 %>%
      group_by(Period, socio_demo) %>%
      mutate(Group = sample(df1$Group, n(), replace = TRUE, prob = df1$Probabilities/sum(df1$Probabilities)))
    

    Before defining the two data frames df1 and df2, we loaded the dplyr library in the preceding code. The operations you described were then carried out with the help of the dplyr library's group_by and mutate functions: grouping the df2 data frame by socio_demo and Period, and then randomly assigning a Group to each row of df2 based on the probabilities of each Group in df1.

    # Groups:   Period, socio_demo [1]
       Period             ID socio_demo           Weight Group
       <chr>           <dbl> <chr>                 <dbl> <dbl>
     1 01.2021-03.2021     1 20yo_West_old_3child   0.1      3
     2 01.2021-03.2021     2 20yo_West_old_3child   0.11     1
     3 01.2021-03.2021     3 20yo_West_old_3child   0.69     3
     4 01.2021-03.2021     4 20yo_West_old_3child   0.14     1
     5 01.2021-03.2021     5 20yo_West_old_3child   0.98     1
     6 01.2021-03.2021     6 20yo_West_old_3child   0.89     1
     7 01.2021-03.2021     7 20yo_West_old_3child   0.45     1
     8 01.2021-03.2021     8 20yo_West_old_3child   0.1      1
     9 01.2021-03.2021     9 20yo_West_old_3child   0.4      1
    10 01.2021-03.2021    10 20yo_West_old_3child   0.3      1