Search code examples
rdplyrtidyversegrouping

tidyverse: Creation of a custom grouping variable in R


I wanted to create a group variable with specific conditions based on the values of GT. If GT is 'I', it should generate a different group number. If GT is 'G', the following rules should apply:

  1. For the first four consecutive 'G's, assign the same group number, and for the next four consecutive 'G's, assign the next group number.
  2. If there are five consecutive 'G's, assign the same group number to the first three 'G's and a different group number to the remaining two 'G's.

My Attempt:

library(tidyverse)
df1 <- 
  data.frame(
   GT = c(rep("G", 9), rep("I", 2), rep("G", 2), rep("I", 1))
  )

df2 <- 
  df1 %>%
  mutate(
      grp1 = case_when(
         GT == "I" ~ row_number()
      , .default = consecutive_id(GT)
      )
         ) %>%
  group_by(GT, grp1) %>%
  mutate(count = n()) %>%
  ungroup() %>% 
  mutate(
    grp2 = case_when(
        GT == "G" & count %/% 4 > 0  ~ row_number() %/% 5 + row_number(1)
      , .default = grp1
    )
  , grp3 = case_when(
        GT == "G"~ (cumsum(GT == "G") - 1) %/% 4 + 1
      , .default = grp1 
        )
      )

df2
#> # A tibble: 14 × 5
#>    GT     grp1 count  grp2  grp3
#>    <chr> <int> <int> <dbl> <dbl>
#>  1 G         1     9     1     1
#>  2 G         1     9     1     1
#>  3 G         1     9     1     1
#>  4 G         1     9     1     1
#>  5 G         1     9     2     2
#>  6 G         1     9     2     2
#>  7 G         1     9     2     2
#>  8 G         1     9     2     2
#>  9 G         1     9     2     3
#> 10 I        10     1    10    10
#> 11 I        11     1    11    11
#> 12 G         3     2     3     3
#> 13 G         3     2     3     3
#> 14 I        14     1    14    14

The group variable grp2 is very close to requirements. However, I want df2[8:9, 4] should have different group number than df2[5:7, 4]. Any hint please!


Solution

  • Tidyverse approach:

    This is similar to your code above, with just a couple of changes.

    The grp1 column is a grouping column where:

    • if GT == "I" then each row is a group
    • if GT == "G" then each set of consecutive Gs is a group

    The grp2 column is to get the correct grouping within the sets of consecutive Gs. In general, (rownum() - 1) %/% 4 gives the correct groups except where the last group has only one row. In that case the second from last group value (where row_number() == n() - 1) needs to have + 1 added.

    (In your code row_number() %/% 5 gives correct values for series of up to length 8, (row_number() - 1) %/% 4 produces the correct groupings for larger series).

    Finally I use the unique combinations of grp1 and grp2 to create a new id variable to ensure uniqueness.

    library(tidyverse)
    
    df1 <- data.frame(GT = c(rep("G", 9), rep("I", 2), rep("G", 2), rep("I", 1), rep("G", 5), "I", rep("G", 10)) )
    
    want <- df1 %>%
      mutate(
        grp1 = consecutive_id(GT, if_else(GT == "I", row_number(), -1))
      ) %>%
      mutate(
        grp2 = case_when(
          n() %% 4 != 1 & GT == "G" ~ (row_number() - 1) %/% 4,
          GT == "G" ~ if_else(row_number() == n() - 1, (row_number() - 1) %/% 4 + 1, (row_number() - 1) %/% 4),
          .default = grp1  # This is where GT == "I"
        ),
        .by = grp1
      ) %>%
      mutate(group = consecutive_id(grp1, grp2)) %>%
      select(-grp1, -grp2)
    
    want
    #>    GT group
    #> 1   G     1
    #> 2   G     1
    #> 3   G     1
    #> 4   G     1
    #> 5   G     2
    #> 6   G     2
    #> 7   G     2
    #> 8   G     3
    #> 9   G     3
    #> 10  I     4
    #> 11  I     5
    #> 12  G     6
    #> 13  G     6
    #> 14  I     7
    #> 15  G     8
    #> 16  G     8
    #> 17  G     8
    #> 18  G     9
    #> 19  G     9
    #> 20  I    10
    #> 21  G    11
    #> 22  G    11
    #> 23  G    11
    #> 24  G    11
    #> 25  G    12
    #> 26  G    12
    #> 27  G    12
    #> 28  G    12
    #> 29  G    13
    #> 30  G    13
    

    Created on 2024-05-25 with reprex v2.1.0

    Original approach corrected to work for 5 consecutive 'G's

    I followed your approach of creating the grp1 column, but I used rep() along with some additional modulo arithmetic to calculate the grp2 column which allows us to make the final "3+2" group.

    Because of this I had to switch from case_when() to if...else and add some calls to unique() to avoid recycling the values in the if condition.

    Edit: Updated as original solution failed for any group of 5 consecutuve 'G's

    library(tidyverse)
    
    df1 <-   data.frame(GT = c(rep("G", 9), rep("I", 2), rep("G", 2), rep("I", 1), rep("G", 5), "I", rep("G", 10)) )
    
    want <- df1 %>%
      mutate(
        grp1 = case_when(
          GT == "I" ~ row_number(),
          .default = consecutive_id(GT)
        )
      ) %>%
      mutate(
        grp2 = if(unique(GT) == "I" | unique(n() %/% 4) < 1) {
          grp1
        } else if(unique(n() %% 4) == 1) {
          c(rep(1:(n() %/% 4 - 1), each = 4 * (unique(n())>5)), rep(n() %/% 4, 3), rep(n() %/% 4 + 1, 2))
        } else {
          c(rep(1:(n() %/% 4), each = 4), rep(n() %/% 4 + 1, each = n() %% 4))
        } ,
        .by = c(GT, grp1)
      ) %>%
      mutate(group = consecutive_id(GT, grp2)) %>%
      select(-grp1, -grp2)
    
    want
    #>    GT group
    #> 1   G     1
    #> 2   G     1
    #> 3   G     1
    #> 4   G     1
    #> 5   G     2
    #> 6   G     2
    #> 7   G     2
    #> 8   G     3
    #> 9   G     3
    #> 10  I     4
    #> 11  I     5
    #> 12  G     6
    #> 13  G     6
    #> 14  I     7
    #> 15  G     8
    #> 16  G     8
    #> 17  G     8
    #> 18  G     9
    #> 19  G     9
    #> 20  I    10
    #> 21  G    11
    #> 22  G    11
    #> 23  G    11
    #> 24  G    11
    #> 25  G    12
    #> 26  G    12
    #> 27  G    12
    #> 28  G    12
    #> 29  G    13
    #> 30  G    13
    

    Created on 2024-05-25 with reprex v2.1.0

    Further expalanation of this approach: This uses a slightly different way of calculating the grp2 column. Whereas the tidyverse solution looks at row_number() to decide what the value should be at each row, this approach looks at the total number of Gs (i.e. n()) and calculates a new vector of the same length as n().

    Think of it as creating incrementing sets of 4 numbers (1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, ...) until it gets within 4 places of n(). Then it adds a shorter vector (of length n() %% 4) to fill in the remaining values. This is what c(rep(1:(n() %/% 4), each = 4), rep(n() %/% 4 + 1, each = n() %% 4)) is calculating.

    For the exception, where n() %% 4 == 1 we would finish with a group of length 1. Instead we stop our groups of 4 repeated numbers one step earlier and add 2 new sets with length=3 and length=2 respectivley. This is done in c(rep(1:(n() %/% 4 - 1), each = 4 * (unique(n())>5)), rep(n() %/% 4, 3), rep(n() %/% 4 + 1, 2)).

    (The 4 * (unique(n())>5)) resolves to 0 in the case of a run of exactly 5 Gs so it doesn't include any sets of 4 identical numbers, just the 3 + 2 that we want. Neglecting this case was why my original solution failed).