Search code examples
node.jsamazon-s3aws-lambda

AWS Lambda writing to S3 from ZIP using stream is sometimes "delayed" and throws `NoSuchKey` error


I'm working with a project to receive a ZIP archive uploaded to S3, this will trigger an SQS event, which in turn triggers a Lambda (Node18). The Lambda reads the ZIP archive and extracts it (using the module unzipper) as a stream:

const zipResponse = await S3.send(new GetObjectCommand(params));
const zip = zipResponse.Body.pipe(unzipper.Parse({ forceStream: true })); // zipResponse.Body it self is a readstream

I then use a for await loop to loop through each "file" in the archive where I gather some metadata in the for-loop and inside of it I call the upload function:

for await (const entry of zip) {
  // Gather some metadata here... removed...
  entry.pipe(uploadFromStream(metadata));
}

The uploadFromStream() uses a PassThrough stream to send it to S3:

function uploadXMLToS3(s3XMLParams) {
  return S3.send(new PutObjectCommand(s3XMLParams));
}

function uploadFromStream(metadata) {
  const pass = new stream.PassThrough();
  const ContentType = getContentType(metadata.extension);
  const Key = `${metadata.reg}/${metadata.correlationId}.${extension}`;

  const params = {
    Bucket: metadata.bucket,
    Key,
    ContentType,
    Body: pass,
    ContentLength: metadata.size // require to pass when dealing with streams https://stackoverflow.com/a/76673581/10187742
  };

  // ==> Original code: S3.send(new PutObjectCommand(params));
  uploadXMLToS3(params);

  return pass;
}

After this I will continue processing the files but a new loop and reading the individual files from S3.

Running this code locally using AWS credentials there are never problems, and the files are all uploaded to S3 and I can continue with my "post-processing".

However, deploying it to AWS and running it in the Lambda a few files now and again fails when I try to pick them up for the post-processing and gives NoSuchKey: The specified key does not exist.

If I go and check on S3 (using the Console) the file it said didn't exist is actually there in the Bucket so it is some "timing" issue where it hasn't "fully" written the file to the S3 Bucket when I try to pick it up...

Am I missing something here? I am not the "streams expert", that I can tell you immediately, so it might be I have not fully understood how to use a stream to write to S3...

Here's a screengrab from CloudWatch showing the problem, where the first invocation works just fine, while the second fails: enter image description here

EDIT (some time later): I've restructured it a bit and now updated the code above. It is better now, but larger ZIP archive (about 5000 files in them, still causes the same issues)

EDIT 2: I have a chat with a knowledgeable coder who suggested to add async/await for the S3 writing, but that's not an option for working with stream it seems as I then constantly get some weird TypeError: dest.on is not a function at Readable.pipe (node:internal/streams/readable:692:8)... error...


Solution

  • After further investigation and logging I can confirm this is a "timing" issue in AWS SDK v3 for Node in PutObjectCommand.

    I also got this (more or less) confirmed by AWS Support team for Lambda, who recommended to use a HeadObjectCommand prior to start parsing the "files" to make sure they exist. Kind of a "crappy" solution if you ask me... AWS S3 team states pretty much "works as designed" and tells me to rewrite my code or not use streams... duh...

    After some Googling I did find that the old reliable upload is actually present in AWS SDK v3, but under the library lib-storage, thus:

    const { Upload } = require('@aws-sdk/lib-storage');

    I rebuilt my uploadFromStream() function to use upload instead, which is NOT suffering from the same "timing issue" as PutObjectCommand does.

    SO. lessoned learned, if you want to stream to S3 using SDK v3, use upload!