Search code examples
rintervals

How to convert numbers of values to be its intervals that it falls into in R?


I am dealing a problem that assigns serval numbers in a column to be its corresponding characterized intervals. The intervals and its original values examples are shown below

VehicleDriverCarrierPremium_Interval<-c("(Null)",">= 0, <100",">= 100, < 200",">= 200, < 300",">= 300, < 400",">= 400, < 500",">= 500, < 600",">= 600, < 700",">= 700, < 800",">= 800, < 900")
VehicleDriverCarrierPremium<-c(423,12,NA,535,231,875)

What I want at the end would be like this:

VehicleDriverCarrierPremium [1] ">= 400, < 500" ">= 0, <100" "(Null)" ">= 500, < 600" ">= 200, < 300" ">= 800, < 900"

The problems are the original values is from 0 to 50000, and the interval levels actually do not have certain patterns, the length of the intervals will be changed as the value get larger. And there is a comma if the value is great than 1000. For example, the last two intervals are:

">= 9,000, <10,000", ">= 10,000, <50,000"

What I have done so far is very manual, I divide the different intervals into several groups and use the if and for statement to convert the original values to be its corresponding intervals. But when the levels of intervals and length of intervals changed, I have to changed manually.

So I am wondering if there is any better way can read the levels of intervals first, whose type is character. And then change the original values that falls into its corresponding intervals to be its interval.

Please let me know if you have any more information. Thank you!


Solution

  • Ok here is a different approach. I am quite sure there are easier way and more efficient. I am using tidyverse to transform your character interval into 2 columns begin and end.

    library(tidyverse)
    tibble(int_ID = c(">= 0, <100",
                  ">= 100, <200",
                  ">= 200, <1,000",
                  ">= 1,000, <2,000",
                  ">= 2,000, <3,000",
                  ">= 3,000, <5,000",
                  ">= 5,000, <50,000")) %>% 
      separate(int_ID, into=c("begin","end"), ", ",remove = FALSE) %>% 
      mutate(begin = str_sub(begin,4)) %>% 
      mutate(end = str_sub(end,2)) %>% 
      mutate_at(vars(begin,end),~as.integer(str_remove(.,","))) -> intervals
    
    VehicleDriverCarrierPremium_factor <- c()
    for(i in 1:length(VehicleDriverCarrierPremium) ){ # for each element
      print(VehicleDriverCarrierPremium[i])
      if(!is.na(VehicleDriverCarrierPremium[i])){
        for (j in 1:length(intervals$int_ID)){ # we test on which interval he goes
          if(VehicleDriverCarrierPremium[i]>= intervals$begin[j] & VehicleDriverCarrierPremium[i] < intervals$end[j]){
            VehicleDriverCarrierPremium_factor <- c(VehicleDriverCarrierPremium_factor, intervals$int_ID[j])
          }
        }
        }else{
          VehicleDriverCarrierPremium_factor <- c(VehicleDriverCarrierPremium_factor, "(Null)")
    
      }
      print(VehicleDriverCarrierPremium_factor)
    }
    
    VehicleDriverCarrierPremium<-c(423,12,NA,535,231,875,9000)
    

    It might take a while if you have ten of thousands of values to categorize and hundreds of interval. Even with this code we can do a lot better in term of performance if you need it.

    Hopes it is what you wanted.

    Tom