Currently trying to create a python script that will check a google document for various SEO onpage metrics.
The google docs API has a good sample showing how to extract ALL the text from a google document. However, this simply returns plain text with no formatting.
To perform my checks I need to be able to split out the H1, H2-H4, text in bold etc but after two hours of playing around/searching around the API docs/web, I can't figure out how to edit the following loop to be able to get (for example) all the HEADING_2 elements.
text = ''
for value in elements:
if 'paragraph' in value:
elements = value.get('paragraph').get('elements')
for elem in elements:
text += read_paragraph_element(elem)
elif 'table' in value:
# The text in table cells are in nested Structural Elements and tables may be
# nested.
table = value.get('table')
for row in table.get('tableRows'):
cells = row.get('tableCells')
for cell in cells:
text += read_strucutural_elements(cell.get('content'))
elif 'tableOfContents' in value:
# The text in the TOC is also in a Structural Element.
toc = value.get('tableOfContents')
text += read_strucutural_elements(toc.get('content'))
return text
Any help appreciated. Thanks.
I believe your goal and your current situation as follows.
HEADING_2
of the paragraph style.namedStyleType
is HEADING_2
, the text is required to be retrieved.When this point is reflected to your script, it becomes as follows.
for value in elements:
if 'paragraph' in value:
elements = value.get('paragraph').get('elements')
To:
for value in elements:
if 'paragraph' in value and value['paragraph']['paragraphStyle']['namedStyleType'] == 'HEADING_2': # Modified
elements = value.get('paragraph').get('elements')