I Am trying to download a PDF file from a GCS storage bucket and read the content into memory.
When using Langchain with python, i can just use the GCSDirectoryLoader to read all the files in a bucket and the pdf text.
Langchain for NodeJs doesnt have GCSDirectoryLoader or a webloader for PDF files. When downloading the file, i get a Document with the binary representation as content.
What is the best way to download pdf content from GCS bucket into memory?
I ended up doing the following for GCS bucket:
documentBucket.getFiles()
...
const [buffer] = await file.download();
const options = { normalizeWhitespace: true };
// Using 3rd part lib => pdf.js-extract
await pdfExtract.extractBuffer(buffer, options)
.then((data) => {
...
}
And for Google drive:
const drive = google.drive({
version: 'v3',
auth,
})
const resp = await drive.files.get({ fileId: file.id, alt: "media" }, {responseType: 'arraybuffer'});
const buffer = new Buffer.from(resp.data);
const options = { normalizeWhitespace: true };
// Using 3rd part lib => pdf.js-extract
await pdfExtract.extractBuffer(buffer, options)
.then((data) => {
...
}
The documentation for the API could be clearer, what I ended up needing was setting the responseType to 'arrayBuffer', which i couldn't find in the docs.
I'm still going to put some time into determening if the 3rd party lib is really needed... but that has a lower priority for me atm