Search code examples
arraysexcelregexvbapowerquery

Extract matched keyword and its matching line or paragraph as regex submatches


The following code snippet loops through single-column tables containing keywords, in a keywords sheet and tries matching these concatenated keywords of each column, to a text string (Str) that has been concatenated from an array containing verbatims.

I need a regex pattern that can extract not only the keyword as a submatch, but the entire line/paragraph in which the keyword match was found.

    '========================= Grab keyword tables for matching ======================================

    Set dRegex = CreateObject("Scripting.Dictionary")
    Set dDupl = CreateObject("Scripting.Dictionary")
    
    ' Keywords Sheet
    With wkKeywords
        ' find last col on header row 1
        x = .Cells(1, .Columns.Count).End(xlToLeft).Column
        ' scan across header
        For c = 1 To x
            colname = Trim(.Cells(1, c))
            If Len(colname) > 0 Then
                ' last row
                n = .Cells(.Rows.Count, c).End(xlUp).Row
                
                ReDim ar(0 To n - 2)
                'modify for regex
                For i = 2 To n
                    s = Trim(.Cells(i, c))
                    
                     If Len(s) = 0 Then
                        MsgBox "Nothing in keyword row " & i & " column " & c, vbCritical
                        Exit Sub
                    End If
                    ar(i - 2) = Replace(s, "'", "'?") ' optional
                    ar(i - 2) = Replace(ar(i - 2), """", """?")
                Next i
                
                Set dRegex(colname) = CreateObject("VBScript.RegExp")
                With dRegex(colname)
                    .Global = True
                    .MultiLine = False
                    .ignorecase = True
'                    .Pattern = "(\b" & Join(ar, "\b)|(\b") & "\b)"
'                    .Pattern = "((.*?\b" & Join(ar, "\b|\b") & ".*?)(?=\b|$))"
'                    .Pattern = "(\b(" & Join(ar, "|") & ")\b)(.*?(\r\n|\r|\n|$))"
'                    .Pattern = "((.*?)(\b" & Join(ar, "\b|\b") & "\b)(.*?))" '(\r\n|\r|\n|$))"
                    .Pattern = "((?:.*(?:\r\n|\r|\n|$))?(\b" & Join(ar, "\b|\b") & "\b)(?:.*(?:\r\n|\r|\n|$))?)"

                    'Debug.Print colname, n, .Pattern
                End With
            End If
        Next c
    End With
        
    '========================= Transpose data Array ======================================
    tempArr = ArrOut
    Erase ArrOut
    ArrOut = TransposeArray(tempArr)
    
    '========================= Join textual data as a concatenated string for matching ======================================
    
    For i = LBound(ArrOut, 1) To UBound(ArrOut, 1)
        str = str & ArrOut(i, 1) & ":" & ArrOut(i, 2) & vbCr
    Next i
        
Stop

    '========================= Find Matches in textual data ======================================
    For Each colname In dRegex.keys
        Set matches = dRegex(colname).Execute(str)
 '       DoEvents
        Stop
        For i = 0 To (matches.Count - 1)
            For j = 0 To matches(i).submatches.Count - 1
                If Not matches(i).submatches(j) = Empty Then
                    Debug.Print colname, Trim(matches(i).submatches(j))
                End If
            Next j
            Stop
        Next i
        Stop
    Next colname

Stop

I have tried a couple of regex patterns and some of them return just the keyword, or the keyword and a part of the paragraph after the keyword or entire paragraphs before and after the keyword.

I just need the entire line or paragraph returned in which the keyword was matched. So a keyword say "eager" could be found in multiple different lines/paragraphs. I want to list the keyword and its matching paragraphs.

The lines/paragraphs in the string could like :

tody:Never really considered what it could do. But now the question is across as far as lack of sleep could affect how your body generates. I am eager to know much more. anabelle:Understand it's something to do with the body clock. That's all I know really, you can feel differently in different times of day, stuff like that. tody:I'd like to know if this lotion is long-lasting, and how much it would be and how many times you would have to apply it? I know it is said that it would take up the six or eight weeks, but how many times would you need to apply it? simran:I think this lotion could be different, because it works with your rhythm, and when you go to sleep, so it could have a more lasting effect, and since you put it on your face at the end of the night instead of the morning when you wake up.

Some keywords in one of the keywords columns to be matched in the paragraphs, could be:

affect
eager
feel differently
long-lasting
different
lasting effect

Please note, i don't want to loop the textual data array ArrOut by splitting it into individual lines, but want to do it in one go without much looping.

I am not a pro in regex expressions. Please let me know if you need anything else.


Solution

  • Following from my comment about extracting the matched paragraph if you know the match position within the text:

    Sub GetMatchedParagraph()
    
        Dim txt As String, pos As Long, posB As Long, posE As Long
        
        txt = "This is the first paragraph." & vbLf & _
                "This is the second paragraph." & vbLf & _
                "This is the third paragraph." & vbLf & _
                "This is the fourth paragraph." & vbLf & _
                "This is the fifth paragraph."
        
        pos = InStr(1, txt, "second", vbTextCompare) 'find "second"
        If pos = 0 Then Exit Sub
        
        
        'previous newline, or start of text if none
        posB = InStrRev(txt, vbLf, pos)
        If posB = 0 Then posB = 1
        
        'next newline, or end of text if none
        posE = InStr(pos, txt, vbLf)
        If posE = 0 Then posB = Len(txt)
        
        Debug.Print Mid(txt, posB, (posE - posB)) 'the matched paragraph
    
    End Sub