Search code examples
node.jspdfgoogle-cloud-storagelangchain-js

Downloading PDF text into memory from GoogleCloudStorage using Langchain NodeJS


I Am trying to download a PDF file from a GCS storage bucket and read the content into memory.

When using Langchain with python, i can just use the GCSDirectoryLoader to read all the files in a bucket and the pdf text.

Langchain for NodeJs doesnt have GCSDirectoryLoader or a webloader for PDF files. When downloading the file, i get a Document with the binary representation as content.

What is the best way to download pdf content from GCS bucket into memory?


Solution

  • I ended up doing the following for GCS bucket:

    documentBucket.getFiles()
    ...
    
    const [buffer] = await file.download();
    const options = { normalizeWhitespace: true };
    
    // Using 3rd part lib => pdf.js-extract
    await pdfExtract.extractBuffer(buffer, options)
        .then((data) => {
          ...
        }
    

    And for Google drive:

    const drive = google.drive({
        version: 'v3',
        auth,
      })
    
    const resp = await drive.files.get({ fileId: file.id, alt: "media" }, {responseType: 'arraybuffer'});
    const buffer = new Buffer.from(resp.data);
    const options = { normalizeWhitespace: true };
    
    // Using 3rd part lib => pdf.js-extract
    await pdfExtract.extractBuffer(buffer, options)
        .then((data) => {
          ...
        }
    

    The documentation for the API could be clearer, what I ended up needing was setting the responseType to 'arrayBuffer', which i couldn't find in the docs.

    I'm still going to put some time into determening if the 3rd party lib is really needed... but that has a lower priority for me atm