Search code examples
node.jsjsonamazon-web-servicesaws-lambdajs-xlsx

Convert a XLSX to Json using NodeJs and XLSX package


Here is my full scenario: I have a static site hosted on S3. Its just an interface for uploading an file. The upload is done using API Gateway as an endpoint, that triggers an lambda function.

This lambda function will read the XLSX file, process it and upload the json to an S3 Bucket. The PUT event on the bucket, triggers another lambda function, which will read this json and send as messages to an SQS Queue. Then, as soon an message arrives it spins up an EC2 instance for processing.

The problem is: I don't know what is going on, but my output is getting totally scrambled. It is something like nothing is decoding the base64 after receiving it.

Here is an example of my output:

Output example

The API Gateway has been setup correctly, i'm pretty sure about that:

  1. I've set the binary media type as application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
  2. Added body mapping templates

When testing sending a file trough Postman, it works without problems. Based on this i suppose the API has been setup fine and also the function, its something happening between the browser and the lambda function \ api call.

This is what my log on the lambda shows:

2024-09-03T22:13:31.337Z 2024-09-03T22:13:31.337Z a107932e-0bde-4b4c-ad8d-478625019503 INFO Processed data: [ [ { "numCpfCnpj": "Content-Disposition: form-data", "numAcordo": " name="file"" }

], null, ]....lots of binary data

This part specially:

"numCpfCnpj": "Content-Disposition: form-data",
"numAcordo": " name=\"file\"" }

Got my attention, because it looks like something is wrong with the decoding, since those headers are present where there should be data.

This is the code i'm using to Read the XLSX file:

import XLSX from 'xlsx';
import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3';

const s3Client = new S3Client({ region: process.env.AWS_REGION });

const processXlsxFile = (fileContent) => { // Corrected arrow function syntax and added closing brace
    const workbook = XLSX.read(fileContent, { type: 'buffer' });
    const sheet = workbook.Sheets[workbook.SheetNames[0]];

    // Convert all rows to JSON, treating every cell as text to preserve leading zeros
    const data = XLSX.utils.sheet_to_json(sheet, {
        header: 1, // Use the first row as the header
        raw: false // Ensure that all cells are treated as strings
    }).slice(1); // Skip the header row

    // Process the data and pad CPF/CNPJ numbers as needed
    return data.reduce((acc, row, index) => { // Corrected arrow function syntax
        // Skip empty or invalid rows
        if (!row[0] || !row[1]) return acc;

        // Determine if the number is CPF (11 digits) or CNPJ (14 digits)
        const numCpfCnpj = String(row[0]).length <= 11 
            ? String(row[0]).padStart(11, '0')  // Pad CPF to 11 digits
            : String(row[0]).padStart(14, '0'); // Pad CNPJ to 14 digits

        const groupIndex = Math.floor(index / 2);
        if (!acc[groupIndex]) acc[groupIndex] = [];
        acc[groupIndex].push({
            numCpfCnpj,
            numAcordo: String(row[1])
        });
        return acc;
    }, []); 
};

export const handler = async (event) => { // Corrected arrow function syntax and added closing brace
    try {
        console.log("Received event:", JSON.stringify(event, null, 2));
        
        if (event.body) {
            // Decode the base64-encoded file content
            const fileContent = Buffer.from(event.body, 'base64');
            console.log("File content received");

            // Process the Excel file content
            const processedData = processXlsxFile(fileContent);
            console.log("Processed data:", JSON.stringify(processedData, null, 2));

            // Flatten the processed data to a single array
            const flatData = processedData.flat();
            console.log("Flattened data:", JSON.stringify(flatData, null, 2));

            // Convert the result to a JSON string
            const jsonString = JSON.stringify(flatData);
            const jsonFileName = `converted_data_${Date.now()}.json`;
            const bucketName = process.env.BUCKET_NAME;

            // Upload the JSON to S3
            const putObjectParams = {
                Bucket: bucketName,
                Key: jsonFileName,
                Body: jsonString,
                ContentType: 'application/json'
            };

            const command = new PutObjectCommand(putObjectParams);
            await s3Client.send(command);

            console.log(`JSON file uploaded successfully to S3: ${jsonFileName}`);

            return {
                statusCode: 200,
                body: JSON.stringify({ message: 'File processed and JSON uploaded to S3 successfully.' }),
                headers: {
                    'Content-Type': 'application/json'
                }
            };
        } else {
            return {
                statusCode: 400,
                body: JSON.stringify({ message: 'No file uploaded' }),
                headers: {
                    'Content-Type': 'application/json'
                }
            };
        }
    } catch (error) {
        console.error("Error processing file:", error);
        return {
            statusCode: 500,
            body: JSON.stringify({
                message: 'Internal Server Error',
                error: error.message
            }),
            headers: {
                'Content-Type': 'application/json'
            }
        };
    }
};

Any input is appreciated.

Also, if anoyone got any questions, i'll be glad to reply as well.

Edit: HTML Code

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>File Processor</title>
    <link rel="stylesheet" href="styles/styles.css"> <!-- Link to external CSS file -->
</head>
<body>
    <div class="container">
        <h1>Process XLSX File</h1>
        <input type="text" id="txtInput" readonly placeholder="No file chosen">
        <input type="file" id="fileInput" style="display:none" accept=".xlsx">
        <button id="btnBrowse">Procurar arquivo</button>
        <button id="btnProcess">Processar</button>
        <div id="results"></div>
    </div>
    <script src="scripts/main.js"></script> <!-- Link to external JavaScript file -->
</body>
</html>

Js Code:

document.addEventListener('DOMContentLoaded', () => {
    const fileInput = document.getElementById('fileInput');
    const txtInput = document.getElementById('txtInput');
    const btnBrowse = document.getElementById('btnBrowse');
    const btnProcess = document.getElementById('btnProcess');
    const resultsDiv = document.getElementById('results');

    btnBrowse.addEventListener('click', () => {
        fileInput.click();
    });

    fileInput.addEventListener('change', () => {
        txtInput.value = fileInput.files[0] ? fileInput.files[0].name : 'No file chosen';
    });

    btnProcess.addEventListener('click', async () => {
        if (!fileInput.files.length) {
            alert('Please select a file first.');
            return;
        }

        const formData = new FormData();
        formData.append('file', fileInput.files[0]);

        try {
            const response = await fetch('https://5lvbhojaaf.execute-api.sa-east-1.amazonaws.com/conversion/readFile', {
                method: 'POST',
                body: formData
            });

            if (!response.ok) {
                throw new Error(`HTTP error! status: ${response.status}`);
            }

            const result = await response.json();

            // Format JSON output for better readability
            resultsDiv.innerHTML = `<pre>${JSON.stringify(result, null, 2)}</pre>`;
        } catch (error) {
            console.error('Error processing file:', error);
            resultsDiv.innerHTML = '<p>Error processing file. Check the console for details.</p>';
        }
    });
});

Postman endpoint: https://5lvbhojaaf.execute-api.sa-east-1.amazonaws.com/conversion/readFile

The weird characters appears after the function execution. On the CloudWatch logs i can see the data being received as base64, and after the function execution it becomes scrambled.


Solution

  • Summary

    • Your client (html/js) is not encoding the file to base64 before the upload
    • Your lambda is expecting a base64 body

    Advice

    • Try locally and debug (or print) line by line if your file is being sending as the backend expects
    • Fix this directly on aws will be a pain. Check this to try locally before the deploy on aws:
    • If you want to avoid the multipart/form-data , convert the file to base64 at the client layer (html/js) and then send the content as a json , so in the lambda handler it will be easy to get the content
      • Warning: If you file tends to increase its size, convert to base64 it would be a problem (ram, bandwidth, browser, etc). That's why the multipart/form-data content-type exist. Check this and this

    #1 Base64

    At the client (html/js) you are not encoding the file as base64, so at the lambda layer, this doesn't make sense:

    const fileContent = Buffer.from(event.body, 'base64');

    Maybe using the postman, you are sending the file as base64, but in your html/js you are not converting the file to base64. You are sending it as Content-Type: multipart/form-data

    #2 Multipart Form Data

    (1) Get the content from multipart/form-data is not easy compared when the content type is Content-Type: application/json

    (2) Usually is sent the file name and file content (binary)

    (3) Usually this content type has sections like, size, disposition, content, etc

    enter image description here

    Almost no body deal directly with this (in nodejs), that's why libraries in all languages help the final developer. For example multer library makes our live easy:

    app.post('/upload', upload.single('file'), function(req, res) {
      const title = req.body.title;
      const file = req.file;
      //file is ready to use (pdf, xls, zip, images, etc)
    

    References

    #3 Multipart Form Data with Aws Lambda

    event.body doesn't have the file ready to use. That's is another of your errors

    A quick research, didn't get me some "easy" ways to deal with multipart/form-data using aws lambdas

    Reviewing some libraries, I found the magic to get a file from lambda event variable:

    module.exports.parse = (event, spotText) => {
        const boundary = getBoundary(event);
        const result = {};
        event.body
            .split(boundary)
            .forEach(item => {
                if (/filename=".+"/g.test(item)) {
                    result[item.match(/name=".+";/g)[0].slice(6, -2)] = {
                        type: 'file',
                        filename: item.match(/filename=".+"/g)[0].slice(10, -1),
                        contentType: item.match(/Content-Type:\s.+/g)[0].slice(14),
                        content: spotText? Buffer.from(item.slice(item.search(/Content-Type:\s.+/g) + item.match(/Content-Type:\s.+/g)[0].length + 4, -4), 'binary'):
                            item.slice(item.search(/Content-Type:\s.+/g) + item.match(/Content-Type:\s.+/g)[0].length + 4, -4),
                    };
                } else if (/name=".+"/g.test(item)){
                    result[item.match(/name=".+"/g)[0].slice(6, -1)] = item.slice(item.search(/name=".+"/g) + item.match(/name=".+"/g)[0].length + 4, -4);
                }
            });
        return result;
    };
    

    This is already implemented in these libraries:

    Choose one and try it. I'm sure that the file will be ready to be used in your

    const workbook = XLSX.read(file, { type: 'buffer' });