Search code examples
excelregexvbaregex-group

Get every word ending with dot using Regex/VBA


In Office 2019 Excel spreadsheet. I am trying to extract from any specified cell up to 5 words ending with dot after a ].

Sample text:

some text [asred.] ost. |Monday - Ribben (ult.) lot. ac, sino. other maybe long text; collan.

I expect:

ost. ult. lot. sino. collan.

I found this function on the internet:

Public Function RegExtract(Txt As String, Pattern As String) As String

With CreateObject("vbscript.regexp")
    '.Global = True
    .Pattern = Pattern
    If .test(Txt) Then
        RegExtract = .Execute(Txt)(0)
    Else
        RegExtract = "No match found"
    End If
End With

End Function

I call it from an empty cell:

=RegExtract(D2; "([\]])(\s\w+[.]){0,5}")

My expression:

([\]])(\s\w+[.]){0,5}

It returns:

] ost.

  1. I am not able to get rid of the ] which is needed to find the place where my useful bits start inside the text block, since \K does not work in Excel.

  2. I don't understand how iterators work to get "up to 5 occurrences".
    I expected that {0,5} after the second group meant: repeat the previous group until the end of the text block (or until you manage to do it 5 times).

--Added after JdvD accepted answer for the records--

I am using this pattern to get all the words ending with dot, after the first occurrence of the closing bracket.

^.*?\]|(\w+\.\s?)|.

This one (without the question mark) instead gets all the words ending with dot, after the last occurrence of the closing bracket.

^.*\]|(\w+\.\s?)|.

I was missing something in my regExtract function: I needed to store the matches into an array through a for loop and then output this array as a string. I was assuming that the regex engine was storing matches as a unique string.

The working function:

Public Function RegExtract(Txt As String, Pattern As String) As String

Dim rMatch As Object, arrayMatches(), i As Long

With CreateObject("vbscript.regexp")
    .Global = True
    .Pattern = Pattern
    If .Test(Txt) Then
        For Each rMatch In .Execute(Txt)
            If Not IsEmpty(rMatch.SubMatches(0)) Then
                ReDim Preserve arrayMatches(i)
                arrayMatches(i) = rMatch.SubMatches(0)
                i = i + 1
            End If
        Next
        RegExtract = Join(arrayMatches, " ")
    Else
        RegExtract = "No match found"
    End If
End With

End Function

Solution

  • RegexMatch:

    In addition to the answer given by @RonRosenfeld one could apply what some refer to as 'The Best Regex Trick Ever' which would imply to first match what you don't want and then match what you do want in a capture group. For example:

    ^.*\]|(\w+\.)
    

    See an online demo where in short this means:

    • ^.*\] - Match 0+ (Greedy) characters from the start of the string upto the last occurence of closing square brackets;
    • | - Or;
    • (\w+\.) - Capture group holding 1+ (Greedy) word-characters ending with a dot.

    Here is how it could work in an UDF:

    Sub Test()
    
    Dim s As String: s = "some text [asred.] ost. |Monday - Ribben (ult.) lot. ac, sino. other maybe long text; collan. "
    
    Debug.Print RegExtract(s, "^.*\]|(\w+\.)")
    
    End Sub
    
    '------
    
    'The above Sub would invoke the below function as an example.
    'But you could also invoke this through: `=RegExtract(A1,"^.*\]|(\w+\.)")`
    'on your sheet.
    
    '------
    
    Public Function RegExtract(Txt As String, Pattern As String) As String
    
    Dim rMatch As Object, arrayMatches(), i As Long
    
    With CreateObject("vbscript.regexp")
        .Global = True
        .Pattern = Pattern
        If .Test(Txt) Then
            For Each rMatch In .Execute(Txt)
                If Not IsEmpty(rMatch.SubMatches(0)) Then
                    ReDim Preserve arrayMatches(i)
                    arrayMatches(i) = rMatch.SubMatches(0)
                    i = i + 1
                End If
            Next
            RegExtract = Join(arrayMatches, " ")
        Else
            RegExtract = "No match found"
        End If
    End With
    
    End Function
    

    RegexReplace:

    Depending on your desired output one could also use a replace function. You'd have to match any remaining character with another alternative for that. For example:

    ^.*\]|(\w+\.\s?)|.
    

    See an online demo where in short this means that we added another alternative which is simply any single character. A 2nd small addition is that we added the option of an optional space character \s? in the 2nd alternative.

    Sub Test()
    
    Dim s As String: s = "some text [asred.] ost. |Monday - Ribben (ult.) lot. ac, sino. other maybe long text; collan. "
    
    Debug.Print RegReplace(s, "^.*\]|(\w+\.\s?)|.", "$1")
    
    End Sub
    
    '------
    
    'There are now 3 parameters to parse to the UDF; String, Pattern and Replacement.
    
    '------
    
    Public Function RegReplace(Txt As String, Pattern As String, Replacement) As String
    
    Dim rMatch As Object, arrayMatches(), i As Long
    
    With CreateObject("vbscript.regexp")
        .Global = True
        .Pattern = Pattern
        RegReplace = Trim(.Replace(Txt, Replacement))
    End With
    
    End Function
    

    Note that I used Trim() to remove possible trailing spaces.


    Both RegexMatch and RegexReplace would currently return a single string to clean the input but the former does give you the option to deal with the array in the arrayMatches() variable.