google-api html-parsing google-docs python-docx pypdf

Extracting text from a Google document and get particular page

As of now, I export my Google documents by getting the content from this link:

https://docs.google.com/feeds/download/documents/export/Exportid=DOCUMENT_ID&exportFormat=EXPORT_FORMAT

This works fine, in fact I export my doc to an HTML format then I read from it, but there is no way to know when a page starts or ends.

Here's all the export formats I know of:

HTML, PDF, ODT, TXT, RTF and DOCX

PDF, ODT, RTF and DOCX all indicate separate pages when opened in a renderer. However, after searching for countless APIs for all formats (python-docx, PyPDF4, PyRTF etc), I have not been able to find a working way to read a Google document page by page.

Any suggestions?

Solution

You could use Apps Script with it you can take advantage of the DocumentApp where you can get PageBreaks.

You could then serve your tailored content as a web app.