Search code examples
amazon-s3aws-lambdajava-streamtargunzip

Handling Streaming TarArchiveEntry to S3 Bucket from a .tar.gz file


I am use aws Lamda to decompress and traverse tar.gz files then uploading them back to s3 deflated retaining the original directory structure.

I am running into an issue streaming a TarArchiveEntry to a S3 bucket via a PutObjectRequest. While first entry is successfully streamed, upon trying to getNextTarEntry() on the TarArchiveInputStream a null pointer is thrown due to the underlying GunzipCompress inflator being null, which had an appropriate value prior to the s3.putObject(new PutObjectRequest(...)) call.

I have not been able to find documentation on how / why the gz input stream inflator attribute is being set to null after partially being sent to s3. EDIT Further investigation has revealed that the AWS call appears to be closing the input stream after completing the upload of specified content length... haven't not been able to find how to prevent this behavior.

Below is essentially what my code looks like. Thank in advance for your help, comments, and suggestions.

public String handleRequest(S3Event s3Event, Context context) {

    try {
        S3Event.S3EventNotificationRecord s3EventRecord = s3Event.getRecords().get(0);
        String s3Bucket = s3EventRecord.getS3().getBucket().getName();

        // Object key may have spaces or unicode non-ASCII characters.
        String srcKey = s3EventRecord.getS3().getObject().getKey();

        System.out.println("Received valid request from bucket: " + bucketName + " with srckey: " + srcKeyInput);

        String bucketFolder = srcKeyInput.substring(0, srcKeyInput.lastIndexOf('/') + 1);
        System.out.println("File parent directory: " + bucketFolder);

        final AmazonS3 s3Client = AmazonS3ClientBuilder.defaultClient();

        TarArchiveInputStream tarInput = new TarArchiveInputStream(new GzipCompressorInputStream(getObjectContent(s3Client, bucketName, srcKeyInput)));

        TarArchiveEntry currentEntry = tarInput.getNextTarEntry();

        while (currentEntry != null) {
            String fileName = currentEntry.getName();
            System.out.println("For path = " + fileName);

            // checking if looking at a file (vs a directory)
            if (currentEntry.isFile()) {

                System.out.println("Copying " + fileName + " to " + bucketFolder + fileName + " in bucket " + bucketName);
                ObjectMetadata metadata = new ObjectMetadata();
                metadata.setContentLength(currentEntry.getSize());

                s3Client.putObject(new PutObjectRequest(bucketName, bucketFolder + fileName, tarInput, metadata)); // contents are properly and successfully sent to s3
                System.out.println("Done!");
            }

            currentEntry = tarInput.getNextTarEntry(); // NPE here due underlying gz inflator is null;
        }
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        IOUtils.closeQuietly(tarInput);
    }
}

Solution

  • That's true, AWS closes an InputStream provided to PutObjectRequest, and I don't know of a way to instruct AWS not to do so.

    However, you can wrap the TarArchiveInputStream with a CloseShieldInputStream from Commons IO, like that:

    InputStream shieldedInput = new CloseShieldInputStream(tarInput);
    
    s3Client.putObject(new PutObjectRequest(bucketName, bucketFolder + fileName, shieldedInput, metadata));
    

    When AWS closes the provided CloseShieldInputStream, the underlying TarArchiveInputStream will remain open.


    PS. I don't know what ByteArrayInputStream(tarInput.getCurrentEntry()) does but it looks very strange. I ignored it for the purpose of this answer.