Search code examples
rarraysregexstringalphanumeric

How to get the number/alphabet index pattern in an alphanumeric string in R?


Say I have a string like this: xyz45kpt793rsdwq1

I need to compute its equivalent alphabet & number sequence pattern like this as output: 3a2n3a3n5a1n

Where,
"a" represents alphabet
"n" represents number
and the numeric value tells the count of the continuous piece of either alphabets or numbers

Here is what I tried:

strsplit("xyz45kpt793rsdwq1", "(?=[A-Za-z])(?<=[0-9])|(?=[0-9])(?<=[A-Za-z])", perl=TRUE)

I get the output as:

[[1]]
[1] "xyz"   "45"    "kpt"   "793"   "rsdwq" "1" 

Then I identified whether each of the above values is a set of alphabets or numbers by doing the following (for which I get output as FALSE if it's a set of numbers, and TRUE if it's a set of alphabets):

x <- strsplit("xyz45kpt793rsdwq1", "(?=[A-Za-z])(?<=[0-9])|(?=[0-9])(?<=[A-Za-z])", perl=TRUE)[[1]][2]
grepl("^[A-Za-z]+$", x, perl = T)

I did this for each of the 6 elements. Here I've shown the code for the 2nd element addressed as [[1]][2] as an example.

Next, I found the length of each of the above by nchar(x). Now I can combine these to create the output 3a for the 1st element, 2n for the 2nd element and so on.. Eventually I can combine all of these to get the desired pattern output as 3a2n3a3n5a1n

But this approach I've tried seems a bit of an overkill and too lengthy. And it would get too complicated to do the same if I have an entire column of strings in a dataframe - for each of which I need to compute this pattern.

Can anyone help with a line of code which can do this in a much efficient manner?


Solution

  • You can use gsubfn here:

    library(gsubfn)
    x <- "xyz45kpt793rsdwq1"
    gsubfn("(\\d+)|(\\p{L}+)", function(x,y) ifelse(nzchar(x), paste0(nchar(x),"n"), paste0(nchar(y),"a")), x, perl=TRUE)
    # => [1] "3a2n3a3n5a1n"
    

    The PCRE regex (the perl=TRUE enables the PCRE regex) - (\d+)|(\p{L}+) - matches and captures into Group 1 (x) any one or more digits, or captures any one or more letters into Group 2 (y). If Group 1 matches (nzchar(x)), the replacement is the length of the match (nchar(x)) and n. Else, Group 2 matched, and the replacement is the length of the group + a.