Search code examples
javascriptgoogle-docs

I want to extract the text from a Google Doc using an extension in the browser and preserve semantic line breaks


I have a browser extension (Firefox and Chrome) that works much like a spell checker. It mostly works fine when getting text values from input and textarea and even most contenteditable elements. However Google Docs likes to insert \n for visual reasons which makes getting semantic paragraphs and sentences challenging.

e.g. the text:

A Long Heading That Visually Wraps With No Period On The End
 
A sentence that runs long enough that it visually wraps in Google Docs and ends up with extra line breaks. Another shorter sentence.

when extracted from the Google Docs DOM and run through JSON.stringify shows up thus:

"\"A Long Heading That Visually Wraps \\nWith No Period On The End \\n  \\nA sentence that runs long enough that it visually wraps in Google Docs and ends up with extra \\nline breaks. Another shorter sentence.\""

Note the \\n before With which is not semantic, then the \\n \\n after the headline which is semantic, and then the \\n before line which again is not semantic.

In this specific case I can text.replace(/\n \n/g, '!!!').replace(/\n/g, '').replace(/!!!/g, '\n\n') to get a (more) semantic body of text back.

However if there is no double \n after the heading then it doesn't work.

You can see how fragile it can be.

Is there a JavaScript DOM/API for a Google Doc that doesn't require extra authorisation so that I can get the clean text of a document? The user has already installed this extension and having to also authorise an app for their Google Drive is not viable.

Alternatively is there a JavaScript sentence tokenizer? Otherwise I'm going to have to ship off the raw text to a Python API endpoint using NTLK/spaCy sentence tokenizer.


Solution

  • Depending on whether the document you want to extract the data is public or not, your application might require authorization or not for extracting the clean data.

    In either way, using the Document App of Apps Script or even the Documents API are both great options to obtain the clean body data and even to select titles, subtitles and so on providing further functionalities than just extracting the document text data.

    NOTE: If you try to access a Document that is not public you will need to use oAuth 2.0. Since it isn't a public resource, you are required to use the credentials of an account that has access to this resource.