I have the following date frame:
ID TX | GROUP |
---|---|
HUDJDUDOOD--BANNK2--OLDODOLD985555545UIJF | 1 |
UJDID YUH23498 IDX09 | 2 |
854 UIJSAZXC | 3 |
I would like to be able to extract the longest string for each value under the column ID TX
knowing that each cell may have different strings or maybe just one but in some instances they may be separated by punctuation such as "," "--", "," "--"
etc or even a space " ".
I have thought of the following I need to first replace punctuation by a white space " "
then.. separate or split each cell by " "
after that I will calculate the length of each string perhaps with nchart()
or str_length()
and select the index of the string the the longest value, but I have not been able yet to do so as I cant manage to select the index (word) that I need after splitting the values since I don't know in what index the longest string may be.. my desired output would be:
OUTPUT |
---|
OLDODOLD985555545UIJF |
YUH23498 |
UIJSAZXC |
Side note: no worries there will not be ties.
# Your data
dat <- structure(list(ID_TX = c("HUDJDUDOOD--BANNK2--OLDODOLD985555545UIJF",
"UJDID YUH23498 IDX09", "854 UIJSAZXC"), GROUP = 1:3), class = "data.frame", row.names = c(NA,
-3L))
# Splitting strings in the data
spl <- strsplit(dat$ID_TX, "--|\\s")
# Identify the position of the longest string in each row
idx <- spl|> lapply(nchar) |> lapply(which.max) |> unlist()
# Select the longest string and bind them to a data.frame
mapply(function(x,y) spl[[x]][y], seq_along(idx),idx) |>
as.data.frame() |>
setNames("OUTPUT")
# The result
# OUTPUT
#1 OLDODOLD985555545UIJF
#2 YUH23498
#3 UIJSAZXC