Search code examples
rdataframedplyrcase

Categorization by median within groups


Lets assume a data like this below

ID  Quantity    Group     Indicator
1   0.93        Red       1
2   0.17        Red       1
3   0.01        Red       0
4   0.44        Red       1
5   0.01        Red       0
6   0.86        Red       1
7   0.07        Red       1
8   0.02        Red       0   
9   1.00        Red       1
1   0.65        Blue      1
2   0.17        Blue      1
3   0.02        Blue      0
4   0.01        Blue      0
5   0.09        Blue      1
6   0.86        Blue      1
7   0.05        Blue      0
8   0.23        Blue      1
9   0.01        Blue      0

Now I want to create a new column with three values/categories.

  • 1 if the indicator column value is 0
  • 3 if the indicator column value is 1 and values in column Quantity is at or above the median (median of Quantity for all values where indicator ==1).
  • 2 if the indicator column value is 1 and values in column Quantity is at or below the median (median of Quantity for all values where indicator ==1).

This logic to be applied for values in Group separately.

Expecting an output like this.

median of all values in Quantity where Indicator == 1 for Group Red. median(0.93, 0.17,0.44,0.86,0.07,1.00) = 0.65

median of all values in Quantity where Indicator == 1 for Group Blue. median(0.65, 0.17,0.09,0.86,0.23) = 0.23

ID  Quantity    Group     Indicator   Results
1   0.93        Red       1           3   <- Above the median(0.65) for all values in red where Indicator ==1
2   0.17        Red       1           2   <- Below the median(0.65) for all values in red where Indicator ==1
3   0.01        Red       0           1
4   0.44        Red       1           2   <- Below the median(0.65) for all values in red where Indicator ==1
5   0.01        Red       0           1
6   0.86        Red       1           3   <- Above the median(0.65) for all values in red where Indicator ==1
7   0.07        Red       1           2   <- Below the median(0.65) for all values in red where Indicator ==1
8   0.02        Red       0           1
9   1.00        Red       1           3   <- Above the median(0.65) for all values in red where Indicator ==1

1   0.65        Blue      1           3
2   0.17        Blue      1           2
3   0.02        Blue      0           1
4   0.01        Blue      0           1
5   0.09        Blue      1           2
6   0.86        Blue      1           3
7   0.05        Blue      0           1
8   0.23        Blue      1           2
9   0.01        Blue      0           1 

I have tried this with bunch of ifelse and it is very clumsy. Looking for something efficient with case_when. Thanks in advance.


Solution

  • A dplyr approach to achieve your desired result may look like so:

    library(dplyr, warn = FALSE)
    
    dat |> 
      mutate(
        Results = case_when(
          Indicator == 0 ~ 1,
         Quantity <= median(Quantity) ~ 2,
         .default = 3
        ),
        .by = c(Group, Indicator)
      )
    #>    ID Quantity Group Indicator Results
    #> 1   1     0.93   Red         1       3
    #> 2   2     0.17   Red         1       2
    #> 3   3     0.01   Red         0       1
    #> 4   4     0.44   Red         1       2
    #> 5   5     0.01   Red         0       1
    #> 6   6     0.86   Red         1       3
    #> 7   7     0.07   Red         1       2
    #> 8   8     0.02   Red         0       1
    #> 9   9     1.00   Red         1       3
    #> 10  1     0.65  Blue         1       3
    #> 11  2     0.17  Blue         1       2
    #> 12  3     0.02  Blue         0       1
    #> 13  4     0.01  Blue         0       1
    #> 14  5     0.09  Blue         1       2
    #> 15  6     0.86  Blue         1       3
    #> 16  7     0.05  Blue         0       1
    #> 17  8     0.23  Blue         1       2
    #> 18  9     0.01  Blue         0       1
    

    DATA

    dat <- data.frame(
      ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9),
      Quantity = c(0.93, 0.17, 0.01, 0.44, 0.01, 0.86, 0.07, 0.02, 1.00, 0.65, 0.17, 0.02, 0.01, 0.09, 0.86, 0.05, 0.23, 0.01),
      Group = c("Red", "Red", "Red", "Red", "Red", "Red", "Red", "Red", "Red", "Blue", "Blue", "Blue", "Blue", "Blue", "Blue", "Blue", "Blue", "Blue"),
      Indicator = c(1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0)
    )