I have a column in a data table that consists of codes for purchases. If in my contract (each contract = new row)there is one purchase then the code number is just one character type variable (for instance, 11.25.64). But if my contract has several purchases then the codes are stored in a list. It looks something like this
dt n codes
1 11.25.64
2 c('11.25.16', '25.84.78', '78.26.99')
3 81.62.16
4 c('16.25.16', '99.84.78', '28.26.99')
For the purpose of classifying I want to extract only the first two digits of each of the codes. So, I want to create a new column and to get something like this:
dt n classification_codes
1 11
2 c('11', '25', '78')
3 81
4 c('16', '99', '28')
I tried executing the following code
dt$classification_codes<- substr(dt$codes, start = 1, stop = 2)
Yet, it just works for the rows where I have one code, but for the rows with the list variables, it gives 'c('
dt n classification_codes
1 11
2 c(
3 81
4 c(
Then I tried to use a different approach and do something like this
dt$classification_codes <- lapply(str_split(dt$codes, " "), substr, 1, 3)
But I get the following output. This seems closer to what I want, but still, it is not it. It is as if the first variable in the list isn't readable when I execute the code
dt n classification_codes
1 11
2 c("c(", "\"25","\"78")
3 81
4 c("c(", "\"99", "\"28")
Here is an approach you could try with library stringr
:
a <- c('11.25.16', '25.84.78', '78.26.99')
str_split(a, "\\.")
This gives you a list
> str_split(a, "\\.")
[[1]]
[1] "11" "25" "16"
[[2]]
[1] "25" "84" "78"
[[3]]
[1] "78" "26" "99"
I tried to solve your problem based on the new information given, so I wrote an ugly function for you:
extractor <- function(string) {
tmp <- vector()
if (grepl("^(c[[:punct:]]{2}\\d\\d\\.\\d\\d\\.\\d\\d)", string)) {
tmp <- string %>%
str_extract("^(c[[:punct:]]{2}\\d\\d\\.\\d\\d\\.\\d\\d)") %>%
str_extract("\\d\\d\\.\\d\\d\\.\\d\\d") %>%
str_split("\\.")
tmp <- paste0("c('", tmp[[1]][1],"', '", tmp[[1]][2], "', '", tmp[[1]][3],"')")
} else {
tmp <- string %>%
str_extract("^(\\d\\d)")
}
return(tmp)
}
I suppose you have to use
df$new_line <- df$codes %>% lapply(extractor) %>% unlist