Search code examples
pythongoogle-docsgoogle-docs-api

How to pull headings from Google document using API


Currently trying to create a python script that will check a google document for various SEO onpage metrics.

The google docs API has a good sample showing how to extract ALL the text from a google document. However, this simply returns plain text with no formatting.

To perform my checks I need to be able to split out the H1, H2-H4, text in bold etc but after two hours of playing around/searching around the API docs/web, I can't figure out how to edit the following loop to be able to get (for example) all the HEADING_2 elements.

    text = ''
    for value in elements:
        if 'paragraph' in value:
            elements = value.get('paragraph').get('elements')
            for elem in elements:
                text += read_paragraph_element(elem)
        elif 'table' in value:
            # The text in table cells are in nested Structural Elements and tables may be
            # nested.
            table = value.get('table')
            for row in table.get('tableRows'):
                cells = row.get('tableCells')
                for cell in cells:
                    text += read_strucutural_elements(cell.get('content'))
        elif 'tableOfContents' in value:
            # The text in the TOC is also in a Structural Element.
            toc = value.get('tableOfContents')
            text += read_strucutural_elements(toc.get('content'))
    return text

Any help appreciated. Thanks.


Solution

  • I believe your goal and your current situation as follows.

    • You want to retrieve the texts of HEADING_2 of the paragraph style.
    • You want to achieve this using googleapis for python.
    • You want to achieve your goal using the script in your question.
    • You have already been get the values from Google Document using Docs API.

    Modification point:

    • In this case, I thought that when the value of namedStyleType is HEADING_2, the text is required to be retrieved.

    When this point is reflected to your script, it becomes as follows.

    Modified script:

    From:
    for value in elements:
        if 'paragraph' in value:
            elements = value.get('paragraph').get('elements')
    
    To:
    for value in elements:
        if 'paragraph' in value and value['paragraph']['paragraphStyle']['namedStyleType'] == 'HEADING_2':  # Modified
            elements = value.get('paragraph').get('elements')
    

    Reference: