Search code examples
excelregexvbafunctionlookbehind

VBA regex: extract multiple strings between strings within Excel cell with custom function


Within an Excel column I have data such as:

"Audi (ADI), Mercedes (modelx) (MEX), Ferrari super fast, high PS (FEH)"

There hundreds of models that are described by a name and an abbreviation of three capitalized letters in brackets.

I need to extract the names only and the abbreviations only to separate cells. I succeeded doing this for the abbreviations by the following module:

Function extrABR(cellRef) As String
    Dim RE As Object, MC As Object, M As Object
    Dim sTemp As Variant
    Const sPat As String = "([A-Z][A-Z][A-Z][A-Z]?)"  ' this is my regex to match my string
    
    
Set RE = CreateObject("vbscript.regexp")
With RE
    .Global = True
    .MultiLine = True
    .Pattern = sPat
    If .Test(cellRef) Then
        Set MC = .Execute(cellRef)
        For Each M In MC
            sTemp = sTemp & ", " & M.SubMatches(0)
        Next M
    End If
End With

extrABR = Mid(sTemp, 3)

End Function 

However, I do not manage to do so for names. I thought of just exchanging the regex by the following regex: (^(.*?)(?= \([A-Z][A-Z][A-Z])|(?<=, )(.*)(?= \([A-Z][A-Z][A-Z])), but VBA does not seem to allow lookbehind.

Any idea?


Solution

  • Correct, lookbehinds are not supported, but they are only necessary when your expected matches overlap. It is not the case here, all your matches are non-overlapping. So, you can again rely on capturing:

    (?:^|,)\s*(.*?)(?=\s*\([A-Z]{3,}\))
    

    See the regex demo. Group 1 values are accessed via .Submatches(0).

    Details:

    • (?:^|,) - either start of a string or a comma
    • \s* - zero or more whitespace chars
    • (.*?) - Capturing group 1: any zero or more chars other than line break chars as few as possible
    • (?=\s*\([A-Z]{3,}\)) - a positive lookahead that matches a location that is immediately followed with
      • \s* - zero or more whitespace chars
      • \( - a ( char
      • [A-Z]{3,} - three or more uppercase chars
      • \) - a ) char.

    Demo screenshot:

    enter image description here