Search code examples
javascriptparsingtextlambdaextract

Extract text from pdf/doc/docx file using AWS Lambda (Node.js)


I'm trying to extract text from doc/docx/pdf file in an AWS Lambda function written in Node.js. I need to extract text data as an array of words. I've tried using a few different npm packages, but I've noticed that it just skipping those functions.

AWS Lambda Function:

import { PDFExtract } from "pdf.js-extract";

...

export const handler = async (event) => {

    ...

    const pdfExtract = new PDFExtract();
    const tempFilePath = join(tmpdir(), "resume.pdf");
    const buffer = readFileSync(tempFilePath);
    const wordsList = [];

    await pdfExtract.extractBuffer(buffer, {}, (err, data) => {
        if (err)
            return console.log(err);
        data.pages[0].content.forEach((e) => {
            const str = e.str.trim().split(" ");
            str.forEach((word) => {
                if (word.length > 1)
                    wordsList.push(word);
            });
        });
    });
        console.log(wordsList);
}

files structure

The same code works perfectly fine on my local machine, but when I deploy it to AWS Lambda, it fails to extract any text.


Solution

  • The lambda function gets the file as a base64 string. I was finally able to figure out what the problem was and it was the timeout setting but despite that, I using now Python instead of Node.js and it works much faster. Thanks anyway.