Search code examples
windowsparsingms-wordms-officedoc

Converting bold text within a .doc to marked-up text programmatically


I am currently dealing with a large .docx file (roughly 400 pages). It is divided up into sections that are very easily digestable by humans and look like this :

Bold text

Written paragraph

This is perfectly humanly readable and great. Unfortunately we have an in-house program in our University that uses the mark-up of .docx files to sort them out/do some processing on them. By this I mean that sectioning a .doc/.docx using only bold markup is not enough, you must use the in-built tools within MS Office to do this (as below) :

Image showing the menu page of MS Office where you can highlight a piece of text and set it to a Header 1/header2 etc etc.

So what I need to write is a simple script that will find the text that is bold within a .docx document and convert this text to properly marked up "Heading 1"s, or similar. It doesn't concern me whether or not the .docx file format is maintained or anything like this.

is it possible to do this? What APIs/languages/tools should I start looking into to accomplish this relatively simple task?


Solution

  • Using a short VBA macro you can iterate over all paragraphs and change the style for all paragraphs containing only bold text into a heading style:

    Sub FormatBoldAsHeading()
    
        Dim p As Paragraph
    
        For Each p In ActiveDocument.Paragraphs
            If p.Range.Font.Bold <> wdUndefined And p.Range.Font.Bold Then
                p.Style = WdBuiltinStyle.wdStyleHeading1
            End If
        Next
    
    End Sub