Search code examples
rregexgsubalphanumeric

remove alphanumeric with 2 alphabets followed by 2 digits


a <- c("it is ZZ10ASDJN123 and ZZ100DD22")

How can i remove the words starting with first 2 alphabets followed by starting 2 digit numbers and not remove any alphanumeric more than follows 2 + digit numbers.

Expected output:

"it is and ZZ100DD22"

This code removes the numbers alone. Please help in geting me the expected output.

gsub('[[:digit:]]+', '', a)

Solution

  • You may use

    gsub("\\s*\\b[A-Za-z]{2}\\d{2}(?!\\d)\\w*\\b", "", a, perl=TRUE)
    

    See the regex demo. An alternative:

    gsub("\\s*\\b[A-Za-z]{2}\\d{2}[A-Za-z_]\\w*\\b", "", a)
    

    Details

    • \s* - 0 or more whitespace chars
    • \b - a word boundary
    • [A-Za-z]{2} - two ASCII letters (use \p{L} to match any Unicode letters)
    • \d{2} - two digits
    • (?!\d) - there can be no digit immediately to the right
    • \w* - 0 or more letters, digits or underscores
    • \b - word boundary.

    Add (*UCP) at the start of the regex to make it fully Uniocde-aware.

    R demo:

    a <- c("it is ZZ10ASDJN123 and ZZ100DD22")
    gsub("\\s*\\b[A-Za-z]{2}\\d{2}(?!\\d)\\w*", "", a, perl=TRUE)
    ## => [1] "it is and ZZ100DD22"