Say I have a string like this: xyz45kpt793rsdwq1
I need to compute its equivalent alphabet & number sequence pattern like this as output: 3a2n3a3n5a1n
Where,
"a" represents alphabet
"n" represents number
and the numeric value tells the count of the continuous piece of either alphabets or numbers
Here is what I tried:
strsplit("xyz45kpt793rsdwq1", "(?=[A-Za-z])(?<=[0-9])|(?=[0-9])(?<=[A-Za-z])", perl=TRUE)
I get the output as:
[[1]]
[1] "xyz" "45" "kpt" "793" "rsdwq" "1"
Then I identified whether each of the above values is a set of alphabets or numbers by doing the following (for which I get output as FALSE
if it's a set of numbers, and TRUE
if it's a set of alphabets):
x <- strsplit("xyz45kpt793rsdwq1", "(?=[A-Za-z])(?<=[0-9])|(?=[0-9])(?<=[A-Za-z])", perl=TRUE)[[1]][2]
grepl("^[A-Za-z]+$", x, perl = T)
I did this for each of the 6 elements. Here I've shown the code for the 2nd element addressed as [[1]][2]
as an example.
Next, I found the length of each of the above by nchar(x)
.
Now I can combine these to create the output 3a for the 1st element, 2n for the 2nd element and so on..
Eventually I can combine all of these to get the desired pattern output as 3a2n3a3n5a1n
But this approach I've tried seems a bit of an overkill and too lengthy. And it would get too complicated to do the same if I have an entire column of strings in a dataframe - for each of which I need to compute this pattern.
Can anyone help with a line of code which can do this in a much efficient manner?
You can use gsubfn
here:
library(gsubfn)
x <- "xyz45kpt793rsdwq1"
gsubfn("(\\d+)|(\\p{L}+)", function(x,y) ifelse(nzchar(x), paste0(nchar(x),"n"), paste0(nchar(y),"a")), x, perl=TRUE)
# => [1] "3a2n3a3n5a1n"
The PCRE regex (the perl=TRUE
enables the PCRE regex) - (\d+)|(\p{L}+)
- matches and captures into Group 1 (x
) any one or more digits, or captures any one or more letters into Group 2 (y
). If Group 1 matches (nzchar(x)
), the replacement is the length of the match (nchar(x)
) and n
. Else, Group 2 matched, and the replacement is the length of the group + a
.