I'm trying to extract text from doc/docx/pdf file in an AWS Lambda function written in Node.js. I need to extract text data as an array of words. I've tried using a few different npm packages, but I've noticed that it just skipping those functions.
AWS Lambda Function:
import { PDFExtract } from "pdf.js-extract";
...
export const handler = async (event) => {
...
const pdfExtract = new PDFExtract();
const tempFilePath = join(tmpdir(), "resume.pdf");
const buffer = readFileSync(tempFilePath);
const wordsList = [];
await pdfExtract.extractBuffer(buffer, {}, (err, data) => {
if (err)
return console.log(err);
data.pages[0].content.forEach((e) => {
const str = e.str.trim().split(" ");
str.forEach((word) => {
if (word.length > 1)
wordsList.push(word);
});
});
});
console.log(wordsList);
}
The same code works perfectly fine on my local machine, but when I deploy it to AWS Lambda, it fails to extract any text.
The lambda function gets the file as a base64 string. I was finally able to figure out what the problem was and it was the timeout setting but despite that, I using now Python instead of Node.js and it works much faster. Thanks anyway.