Search code examples
rrandomfactors

Using sample() to create a new variable based on levels of other variables


Consider this df (the one I'm working with is much, much bigger)

set.seed(13)
test <- tibble(A = as.factor(seq(1:10)),
               B = as.factor(sample(c("Apple", "Banana"), 10, replace = T)),
               C = as.factor(sample(c("Cut", "Mashed"), 10, replace = T)),
               D = as.factor(sample(seq(1:3), 10, replace = T)))

I need to create another numeric variable but the data of the new variable needs to be the same where the levels of the other variables are equal. Let me illustrate.

When I do this, or any other method I tried to find

test %>%
  group_by(B,C,D) %>%
  mutate(E = sample(seq(0.01:100, 0.01), 10, replace = T))

I get an error message,

The result I'm after is the following, I need to use sample or a random creator function

         A     B      C      D       E
>      <fct>   <fct>  <fct>  <fct> <fct> 
>      1 1     Banana Mashed 3    0.2
>      2 2     Apple  Cut    1    4
>      3 3     Banana Mashed 1    5
>      4 4     Apple  Mashed 2    3
>      5 5     Banana Cut    1    1.3
>      6 6     Apple  Cut    3    4.7
>      7 7     Banana Mashed 1    5
>      8 8     Banana Mashed 1    5
>      9 9     Banana Cut    3    3.2
>     10 10    Banana Cut    3    3.2

So rows 9 and 10, 3, 7 and 8 need to be the exact same because the levels are the same across certain variables (B,C,D)

Any idea how to do this?


Solution

  • If I am understanding correctly, you want something like this. Basically you want to create your new column on the distinct values of your factor groups, and then join it back in so that they all have the same values.

    library(dplyr)
    
    new_values <- test %>% 
      distinct(B, C, D) %>% 
      mutate(E = sample(seq(0.01, 100, 0.01), n(), replace = T)) 
    
    test %>%
      left_join(new_values, by = c("B", "C", "D"))
    # # A tibble: 10 x 5
    #    A    B       C      D        E
    # <fct>    <fct>  <fct>  <fct>   <dbl>
    #  1 1     Banana Mashed 3       68.0 
    #  2 2     Apple  Cut    1       16.4 
    #  3 3     Banana Mashed 1       80.2 
    #  4 4     Apple  Mashed 2       74.4 
    #  5 5     Banana Cut    1       1.53
    #  6 6     Apple  Cut    3       27.8 
    #  7 7     Banana Mashed 1       80.2 
    #  8 8     Banana Mashed 1       80.2 
    #  9 9     Banana Cut    3       83.4 
    # 10 10    Banana Cut    3       83.4 
    

    You can also do something like this with group_modify(), but it will sort your rows and reorder your columns based on the groups. This code will iterate through each group, add a column E based on a sample of size 1, and then restack all of the resulting groups back into a data frame.

    test %>% 
      group_by(B, C, D) %>% 
      group_modify(~ mutate(.x, E = sample(seq(0.01, 100, 0.01), 1, replace = T)))