i think I have a simple question, but I did not get it. I have something like this:
df <- data.frame(identifier = c("9562231945200505501901190109-5405303
", "190109-8731478", "1901098260031", "
.9..43675190109-3690341", "-1103214010200000190109-8841419", "-190109-5232506-.08001234-111",
"190109-2018362-","51770217835901218103304190109-9339765
"), true_values = c("190109-5405303","190109-8731478","190109-8260031","190109-3690341","190109-8841419",
"190109-5232506","190109-2018362","190109-9339765"))
I used the following function and it almost worked, but I do not know how too avoid the last dash.
I tried str_replace and sth else, but it did not work.
You can try substr
with paste
after removing unwanted parts with gsub
.
tt <- gsub("-\\..*", "", df$identifier)
tt <- gsub("[^0-9]", "", tt)
tt <- substring(tt, nchar(tt)-12)
paste0(substr(tt, 1, 6), "-", substring(tt, 7))
#[1] "190109-5405303" "190109-8731478" "190109-8260031" "190109-3690341"
#[5] "190109-8841419" "190109-5232506" "190109-2018362" "190109-9339765"