Search code examples
rlistdatatablestrsplit

Extract first two digits for each number in the list in R data table


I have a column in a data table that consists of codes for purchases. If in my contract (each contract = new row)there is one purchase then the code number is just one character type variable (for instance, 11.25.64). But if my contract has several purchases then the codes are stored in a list. It looks something like this

dt n  codes
   1  11.25.64
   2  c('11.25.16', '25.84.78', '78.26.99')
   3  81.62.16
   4  c('16.25.16', '99.84.78', '28.26.99') 

For the purpose of classifying I want to extract only the first two digits of each of the codes. So, I want to create a new column and to get something like this:

 dt n  classification_codes
    1  11
    2  c('11', '25', '78')
    3  81
    4  c('16', '99', '28') 

I tried executing the following code

dt$classification_codes<- substr(dt$codes, start = 1, stop = 2)

Yet, it just works for the rows where I have one code, but for the rows with the list variables, it gives 'c('

dt n  classification_codes
    1  11
    2  c(
    3  81
    4  c(

Then I tried to use a different approach and do something like this

dt$classification_codes <- lapply(str_split(dt$codes, " "), substr, 1, 3)

But I get the following output. This seems closer to what I want, but still, it is not it. It is as if the first variable in the list isn't readable when I execute the code

 dt n  classification_codes
    1  11
    2  c("c(", "\"25","\"78")
    3  81
    4  c("c(", "\"99", "\"28")

Solution

  • Here is an approach you could try with library stringr:

    a <- c('11.25.16', '25.84.78', '78.26.99')
    
    str_split(a, "\\.")
    

    This gives you a list

    > str_split(a, "\\.")
    [[1]]
    [1] "11" "25" "16"
    
    [[2]]
    [1] "25" "84" "78"
    
    [[3]]
    [1] "78" "26" "99"
    

    I tried to solve your problem based on the new information given, so I wrote an ugly function for you:

    extractor <- function(string) {
      tmp <- vector()
      if (grepl("^(c[[:punct:]]{2}\\d\\d\\.\\d\\d\\.\\d\\d)", string)) {
        tmp <- string %>% 
              str_extract("^(c[[:punct:]]{2}\\d\\d\\.\\d\\d\\.\\d\\d)") %>%
              str_extract("\\d\\d\\.\\d\\d\\.\\d\\d") %>%
              str_split("\\.")
        tmp <-  paste0("c('", tmp[[1]][1],"', '", tmp[[1]][2], "', '", tmp[[1]][3],"')")
      } else {
        tmp <- string %>%
          str_extract("^(\\d\\d)")
      }
      return(tmp)
    }
    

    I suppose you have to use

    df$new_line <- df$codes %>% lapply(extractor) %>% unlist