Search code examples
vbams-wordvsto

Use regular expressions to modify all headings in a Word document


I am using Office 2013 and I want to use regular expressions to modify all headings in a Word document by removing the part before the heading that consists of numbers, spaces, and punctuation. For example, "Heading 1.1.1.*.Title1.1.1" should become "*.Title1.1.1" after processing. I also need to keep the heading levels unchanged.

I have used the following program, but it is causing an infinite loop. I also tried putting the headings into a list and using Find with wildcards (<[0-9\s.,]*), but I did not get the correct results. What is wrong, and how should I fix it?

Here is the code:

Sub CleanTitlesUsingRegex()

    Dim para As Paragraph
    Dim title As String
    Dim regex As Object
    
    Set regex = CreateObject("VBScript.RegExp")
    With regex
        .Pattern = "^[\d\s.,]+"
        .Global = True
    End With
    
    For Each para In ActiveDocument.Paragraphs
        If para.Style Like "Heading*" Then
            title = para.Range.Text
            para.Range.Text = regex.Replace(title, "")
        End If
    Next para

End Sub


Sub CleanTitlesUsingFind()

    Dim myfind As find
    Set myfind = ActiveDocument.Content.find
    Dim titleStyles As Variant
    titleStyles = Array("Heading 1", "Heading 2", "Heading 3") 
    Dim styleName As Variant

    For Each styleName In titleStyles
    With myfind
        .ClearFormatting
        .text = "<[0-9\s.,、]*"  
        .Replacement.text = "" 
        .Forward = True
        .Wrap = wdFindContinue
        .Format = True
        .MatchCase = False
        .MatchWholeWord = False
        .MatchWildcards = True
        .MatchSoundsLike = False
        .MatchAllWordForms = False

        .style = styleName
        .Execute Replace:=wdReplaceAll
    End With
Next
End Sub

Solution

  • The infinite loop results from the fact that you are replacing the entire range of the paragraph, including the paragraph mark. That seems to be enough to cause Word's For Each to "get lost" so the code repeatedly processes the same paragraph.

    This type of problem is common when working with the Word object model. A traditional solution is to avoid For Each and process the paragraphs in reverse order, using something more like this:

    Dim i As Long
    For i = ActiveDocument.Paragraphs.Count To 1 Step -1
      With ActiveDocument.Paragraphs(i)
      ' do your thing
      End With
    Next
    

    Or here, you can probably modify the Range to exclude the paragraph mark and do something like this:

    Dim rng As Object ' Word.Range
    For Each para In ActiveDocument.Paragraphs
      If para.Style Like "Heading*" Then
        Set rng = para.Range
        rng.End = rng.End - 1
        title = rng.Text
        rng.Text = regex.Replace(title, "")
      End If
    Next
    

    Also, I'm not sure what you're actually trying to match because if you have Heading 1.1.1*Something 1.1.1 and you just want to remove the digits/./,/spaces immediately before the * then I think you need a Patterm nore like [\d\s.,]+\*. (Although that would remove the *).

    It may well be possible to do the same thing without VBScript.RegExp using a Word Wildcard replace, but I think you need to clarify exactly what you need to search for and replace. (Plus, I am sure there are other people here who are better at Regexp and Word Find/Replace patterns than me).