Search code examples
regexms-wordpage-numbering

MS Word + VBA + RegExp: Get Page Number of Match


Is that possible? Probably not? How can I then find all exact occurrences of a match and the according page numbers?

EDIT:

I have the regex working properly. What I need is for each match to get all the pages it appears on.

Example:

regex = \b\d{3}\b

123 appears on page 1,4,20
243 appear on page 3,5,7
523 appears on page 9

How can I get that information (all the pages a match occurs on?)

This is for creating some kind of index automatically.

EDIT 2:

I got a basic working version, snippet:

Set Matches = regExp.Execute(ActiveDocument.range.Text)

For Each Match In Matches    
    Set range = ActiveDocument.range(Match.FirstIndex, Match.FirstIndex + Len(Match.Value))    
    page = range.Information(wdActiveEndAdjustedPageNumber)

The problem is that Match.FirstIndex does not always point to the first character of the match in ActiveDocument.range. Word tables mess this up as ActiveDocument.range.Text contains characters that are not on the text put represent something in the table.


Solution

  • This turned out to be rather complex and I can't say if my solution works for any document. The main issue is as indicated in the Question, that RegexMatch.FirstIndex can not be used to determine were the actually Match is within the MS Word Document. This is due to the fact that regex matching is done on range.Text property (String) and that string just contains different amount of characters than the range object does and hence Indexes don't match.

    So my solution is for each match, I do a Find in the whole document for that match. the find methods gives a Range object from which the correct page can be determined.

    In my special case a match could be the same thing also different value. Example: 343in my case would be the same as Prefix-343. A second issue was that the matches must be sorted eg 123before 324regardless which one occurs first in the document.

    If you require the Sort Functionality you will also need the following to "modules":

    SortDictionary Function:

    http://www.cpearson.com/excel/CollectionsAndDictionaries.htm

    Module "modQSortInPlace":

    http://www.cpearson.com/Zips/modQSortInPlace.zip

    If no sort is needed you don't need them but you need to remove the according function call SortDictionary Dict, Truefrom my code.

    Now to my code. Soem parts you can remove, especially the formatting one. This is specific to my case. Also if your match is "unique", eg. not prefix or so you can simplify the code too. You will need to reference the "Microsoft Scripting Library".

    Option Explicit
    
    Sub ExtractRNumbers()
    
        Dim Dict As Scripting.Dictionary
        Set Dict = CreateObject("Scripting.dictionary")
    
        Dim regExp, Match, Matches
        Dim rNumber As String
        Dim range As range
    
        Set regExp = CreateObject("VBScript.RegExp")
        regExp.Pattern = "\b(R-)?\d{2}-\d{4,5}(-\d)?\b"
        regExp.IgnoreCase = False
        regExp.Global = True
    
        ' determine main section, only extract R-Numbers from main section
        ' and not the Table of contents as example
        ' main section = section with most characters
    
        Dim section As section
        Dim maxSectionSize As Long
        Dim sectionSize As Long
        Dim sectionIndex As Integer
        Dim currentIndex As Integer
        maxSectionSize = 0
        currentIndex = 1
        For Each section In ActiveDocument.Sections
            sectionSize = Len(section.range.text)
            If sectionSize > maxSectionSize Then
                maxSectionSize = sectionSize
                sectionIndex = currentIndex
            End If
            currentIndex = currentIndex + 1
        Next
    
    
        Set Matches = regExp.Execute(ActiveDocument.Sections(sectionIndex).range.text)
    
    
        For Each Match In Matches
    
            ' If the Document contains Tables, ActiveDocument.range.Text will contain
            ' BEL charachters (chr(7)) that probably define the table structure. The issue
            ' is that then Match.FirstIndex does not point to the actual first charachter
            ' of a Match in the Document.
            ' Also there are other things (unknwon) that lead to the same issue, eg.
            ' Match.FirstIndex can not be used to find the actual "matching word" within the
            ' document. Because of that below commented apporach does not work on a generic document
    
            '   Set range = ActiveDocument.range(Match.FirstIndex, Match.FirstIndex + Len(Match.Value))
            '   page = range.Information(wdActiveEndAdjustedPageNumber)
    
            ' Maybe there is a simpler solution but this works more or less
            ' the exception beign tables again. see http://support.microsoft.com/kb/274003
    
            ' After a match is found the whole document is searched using the find method.
            ' For each find result the page number is put into an array (if it is not in the array yet)
            ' Then the match is formatted properly.
            ' After formatting, it is checked if the match was previously already found
            '
            '   If not, we add a new entry to the dictionary (key = formatted match, value = array of page numbers)
            '
            '   If match was already found before (but potentially in a different format! eg R-87-1000 vs 87-1000 as example),
            '   all additional pages are added to the already found pages.
    
            Set range = ActiveDocument.Sections(sectionIndex).range
            With range.Find
                .text = Match.Value
                .MatchWholeWord = True
                .MatchCase = True
                .Wrap = wdFindStop
            End With
    
            Dim page As Variant
            Dim pages() As Integer
            Dim index As Integer
            index = 0
            ReDim pages(0)
    
            Do While range.Find.Execute() = True
                page = range.Information(wdActiveEndAdjustedPageNumber)
                If Not IsInArray(page, pages) Then
                    ReDim Preserve pages(index)
                    pages(index) = page
                    index = index + 1
                End If
            Loop
    
            ' FORMAT TO PROPER R-NUMBER: This is specific to my case
            rNumber = Match.Value
            If Not rNumber Like "R-*" Then
             rNumber = "R-" & rNumber
            End If
            ' remove possible batch number as r-number
            If Len(rNumber) > 11 Then
                rNumber = Left(rNumber, Len(rNumber) - 2)
            End If
            ' END FORMAT
    
            If Not Dict.Exists(rNumber) Then
                Dict.Add rNumber, pages
            Else
                Dim existingPages() As Integer
                existingPages = Dict(rNumber)
                For Each page In pages
                    If Not IsInArray(page, existingPages) Then
                        ' add additonal pages. this means that the previous match
                        ' was formatted different, eg R-87-1000 vs 87-1000 as example
                        ReDim Preserve existingPages(UBound(existingPages) + 1)
                        existingPages(UBound(existingPages)) = page
                        Dict(rNumber) = existingPages
                    End If
                Next
            End If
    
        Next
        'sort dictionary by key (R-Number)
        SortDictionary Dict, True
        Dim fso
        Set fso = CreateObject("Scripting.FileSystemObject")
    
        Dim stream
        ' Create a TextStream.
        Set stream = fso.CreateTextFile(ActiveDocument.Path & "\" & ActiveDocument.Name & "-rNumbers.txt", True)
    
        Dim key As Variant
        Dim output As String
        Dim i As Integer
        For Each key In Dict.Keys()
            output = key & vbTab
            pages = Dict(key)
            For i = LBound(pages) To UBound(pages)
                output = output & pages(i) & ", "
            Next
            output = Left(output, Len(output) - 2)
            stream.WriteLine output        
        Next
        Set Dict = Nothing
        stream.Close
    End Sub
    
    Private Function IsInArray(page As Variant, pages As Variant) As Boolean
        Dim i As Integer
        IsInArray = False
        For i = LBound(pages) To UBound(pages)
            If pages(i) = page Then
                IsInArray = True
                Exit For
            End If
        Next
    End Function