BACKGROUND
I want to replace words that contain only letters and digits, and both a letter and a digit, with whitespace. I am using VBA as shown in the example below.
See the proposed solution here: https://stackoverflow.com/a/7684859
QUESTION
Why does the regexp match the word "WhyIsThisWordMatched" when it doesn't contain a digit? And how can the regexp be fixed so it only matches words that contain both letters and digits, and only letters and digits?
Public Sub TestMe()
Dim Rx As Object
Dim Txt As String
Set Rx = CreateObject("VBScript.RegExp")
Rx.Global = True
Rx.Pattern = "(^|\s)(?=.*[0-9])(?=.*[a-zA-Z])([a-zA-Z0-9]+)($|\s)"
Txt = "WhyIsThisWordMatched XXX-111"
Txt = Rx.Replace(Txt, " ")
Debug.Print "Result: " & Txt
' Prints the string " XXX-111"
End Sub
Lookaheads do not consume characters, they just assert whether a match is possible. The pattern (?=.*[0-9])
ensures that there is a digit somewhere ahead, and (?=.*[a-zA-Z])
ensures that there is a letter somewhere ahead, but it doesn't ensure that both exist in the same word.
The current pattern (^|\s)(?=.*[0-9])(?=.*[a-zA-Z])([a-zA-Z0-9]+)($|\s)
is matching any word of letters and digits that follows a space or the start of the string and precedes a space or the end of the string, as long as there is a digit somewhere and a letter somewhere in the string (not necessarily in the same word).
This pattern should likely resolve the immediate issue:
(^|\s)(?=\w*[0-9])(?=\w*[a-zA-Z])[a-zA-Z0-9]+($|\s)
So your code would be something like:
Public Sub TestMe()
Dim Rx As Object
Dim Txt As String
Set Rx = CreateObject("VBScript.RegExp")
Rx.Global = True
Rx.Pattern = "(^|\s)(?=\w*[0-9])(?=\w*[a-zA-Z])[a-zA-Z0-9]+($|\s)"
Txt = "WhyIsThisWordMatched XXX-111 abc123"
Txt = Rx.Replace(Txt, " ")
Debug.Print "Result: " & Txt
' Prints the string "WhyIsThisWordMatched XXX-111 "
End Sub
You could also use a negative lookahead, which might be better because it ensures the match is preceded by a non-word character (such as whitespace) or the start of the string.
(?<!\S)(?=\S*[0-9])(?=\S*[a-zA-Z])\S+(?!\S)
If you don't have to use lookaheads then using a word boundary would be the way to go:
\b(?=\w*[0-9])(?=\w*[a-zA-Z])\w+\b