Search code examples
rrangecategories

Create valid range from a factor and apply another factor range in R


I have a csv file of 2 columns, first column are factor categories representing vessel size ranges, and the second column is a vessel class that falls into that size category. I need to use these data to then fill in a new table of different established vessel size ranges. Fore example: my initial raw data is in two columns;

dat.start<-data.frame(category=c(rep("1-10",3), rep("11-20",3), rep("21-30",3), rep("32-40",3), rep("41-50",3), rep("51-59",3)), class=rep(c("a","b","c"),6))

When I aggregate by class by category eg

ag.dat<-aggregate(class ~ category, data = dat.start, length)

you'll see I get a df with a structure str(ag.dat) consisting of chr column and int column.

The next problem is that I need to assign those frequencies of vessel sizes into a table of new, predetermined vessel size categories that are different from the first. For example, below are the new size categories and the frequency of vessel classes based on the original dat.start data

dat.end<-data.frame(category=c("1-20", "21-50", ">50"), class=c(6, 9, 3))

So my question is how to go from dat.start to dat.end? My first thoughts were to somehow split up the chr string categories and create new dat.start and dat.end ranges that can be numerically interpreted, such as what is produced by cut. But then I drew a blank when it came to going the next step of actually creating the frequencies of vessel class based on new categories. Plus the converting of chr string ranges to numerical ranges also stumped me.

The closest solution I found on the web was here I think; Identify the matching range from a list of valid range

but this looks like its written for Python/Pandas and I need to do it in R. Thanks.


Solution

  • If you separate "category" into two columns you can make numerical comparisons, e.g.

    library(tidyverse)
    
    dat.start<-data.frame(category=c(rep("1-10",3), rep("11-20",3), rep("21-30",3), rep("32-40",3), rep("41-50",3), rep("51-59",3)), class=rep(c("a","b","c"),6))
    
    dat.start
    #>    category class
    #> 1      1-10     a
    #> 2      1-10     b
    #> 3      1-10     c
    #> 4     11-20     a
    #> 5     11-20     b
    #> 6     11-20     c
    #> 7     21-30     a
    #> 8     21-30     b
    #> 9     21-30     c
    #> 10    32-40     a
    #> 11    32-40     b
    #> 12    32-40     c
    #> 13    41-50     a
    #> 14    41-50     b
    #> 15    41-50     c
    #> 16    51-59     a
    #> 17    51-59     b
    #> 18    51-59     c
    
    dat.end<-data.frame(category=c("1-20", "21-50", ">50"), class=c(6, 9, 3))
    
    dat.end
    #>   category class
    #> 1     1-20     6
    #> 2    21-50     9
    #> 3      >50     3
    
    dat.start %>%
      separate(category, into = c("min", "max"), sep = "-") %>%
      mutate(category = case_when(max <= 20 ~ "1-20",
                                  min > 20 & max <= 50 ~ "21-50",
                                  min > 50 ~ ">50")) %>%
      summarise(class = n(), .by = category)
    #>   category class
    #> 1     1-20     6
    #> 2    21-50     9
    #> 3      >50     3
    

    Or another potential approach is to use a 'look up' table, e.g.

    lookup_table <- setNames(c("1-20", "1-20", "21-50",
                               "21-50", "21-50", ">50"),
                             unique(dat.start$category))
    lookup_table
    #>    1-10   11-20   21-30   32-40   41-50   51-59 
    #>  "1-20"  "1-20" "21-50" "21-50" "21-50"   ">50"
    
    dat.start %>%
      mutate(category = recode(category, !!!lookup_table)) %>%
      summarise(class = n(), .by = category)
    #>   category class
    #> 1     1-20     6
    #> 2    21-50     9
    #> 3      >50     3
    

    Created on 2023-03-07 with reprex v2.0.2

    There are many different ways to use a lookup table for this type of task, see Canonical tidyverse method to update some values of a vector from a look-up table for more methods / examples