Search code examples
google-apihtml-parsinggoogle-docspython-docxpypdf

Extracting text from a Google document and get particular page


As of now, I export my Google documents by getting the content from this link:

https://docs.google.com/feeds/download/documents/export/Exportid=DOCUMENT_ID&exportFormat=EXPORT_FORMAT

This works fine, in fact I export my doc to an HTML format then I read from it, but there is no way to know when a page starts or ends.

Here's all the export formats I know of:

HTML, PDF, ODT, TXT, RTF and DOCX

PDF, ODT, RTF and DOCX all indicate separate pages when opened in a renderer. However, after searching for countless APIs for all formats (python-docx, PyPDF4, PyRTF etc), I have not been able to find a working way to read a Google document page by page.

Any suggestions?


Solution

  • You could use Apps Script with it you can take advantage of the DocumentApp where you can get PageBreaks.

    You could then serve your tailored content as a web app.