node.js pdf google-cloud-storage langchain-js

Downloading PDF text into memory from GoogleCloudStorage using Langchain NodeJS

I Am trying to download a PDF file from a GCS storage bucket and read the content into memory.

When using Langchain with python, i can just use the GCSDirectoryLoader to read all the files in a bucket and the pdf text.

Langchain for NodeJs doesnt have GCSDirectoryLoader or a webloader for PDF files. When downloading the file, i get a Document with the binary representation as content.

What is the best way to download pdf content from GCS bucket into memory?

Solution

I ended up doing the following for GCS bucket:

documentBucket.getFiles()
...

const [buffer] = await file.download();
const options = { normalizeWhitespace: true };

// Using 3rd part lib => pdf.js-extract
await pdfExtract.extractBuffer(buffer, options)
    .then((data) => {
      ...
    }

And for Google drive:

const drive = google.drive({
    version: 'v3',
    auth,
  })

const resp = await drive.files.get({ fileId: file.id, alt: "media" }, {responseType: 'arraybuffer'});
const buffer = new Buffer.from(resp.data);
const options = { normalizeWhitespace: true };

// Using 3rd part lib => pdf.js-extract
await pdfExtract.extractBuffer(buffer, options)
    .then((data) => {
      ...
    }

The documentation for the API could be clearer, what I ended up needing was setting the responseType to 'arrayBuffer', which i couldn't find in the docs.

I'm still going to put some time into determening if the 3rd party lib is really needed... but that has a lower priority for me atm