Search code examples
dplyrcategorization

Categorize data based on multiple criteria using dplyr


I need to categorize a very long df using a long list of criteria. Here is a simplified version of the criteria as a df:

crit <- data.frame(grp = c("g1", "g1", "g1", "g2", "g2", "g2"),
                   class = c("A", "B", "C", "A", "B", "C"),
                   min = c(1, 3, 5, 8, 10, 12),
                   max = c(3, 5, 8, 10, 12, 14)
                   )

A second df would receive a column containing "class" based on whether the value is linked to "grp" (part 1 of the procedure) and falls within the specified ranges (min, max) (part 2 of the procedure). Also, if a value is below the lowest or above the highest value in a range, it will be categorized as belonging to the lowest/highest "class." For example:

grp val class
g1 0 A
g1 1 A
g2 7 A
g2 11 B
df <- data.frame(grp = c("g1", "g1", "g2", "g2"),
                 val = c(0, 1, 7, 11)
                )

Do you have any suggestions on how to do this using dplyr? Any help is very much appreciated.


Solution

  • One option is something like this:

    df %>%
        left_join(crit, by = "grp", relationship = "many-to-many") %>%
        filter(val >= min & val <= max) %>%
        select(-min, -max)
    

    Essentially, it peforms a kind-of crossjoin, then filters to find the ones that match the criteria.

    Another option is this:

    # group everything by `grp`, so we just have one row for each `grp`, and a list of classes, mins and maxes
    crit <- crit %>%
        mutate(class = list(class), min = list(min), max = list(max), .by = "grp") %>%
        distinct()
    
    df %>%
        left_join(crit, by = "grp") %>%
        mutate(class = pmap(list(val, class, min, max), ~..2[..3 <= ..1 & ..1 <= ..4])) %>% # parallel map
        select(-min, -max) %>%
        unnest(class)