Search code examples
javaamazon-s3bufferjava.util.scannermultipart

Java multipart upload to s3


My method receives a buffered reader and transforms each line in my file. However I need to upload the output of this transformation to an s3 bucket. The files are quite large so I would like to be able to stream my upload into an s3 object.

To do so, I think I need to use a multipart upload however I'm not sure I'm using it correctly as nothing seems to get uploaded.

Here is my method:

public void transform(BufferedReader reader)
{
        Scanner scanner = new Scanner(reader);
        String row;
        List<PartETag> partETags = new ArrayList<>();

        InitiateMultipartUploadRequest request = new InitiateMultipartUploadRequest("output-bucket", "test.log");
        InitiateMultipartUploadResult result = amazonS3.initiateMultipartUpload(request);

        while (scanner.hasNext()) {
            row = scanner.nextLine();

            InputStream inputStream = new ByteArrayInputStream(row.getBytes(Charset.forName("UTF-8")));

            log.info(result.getUploadId());

            UploadPartRequest uploadRequest = new UploadPartRequest()
                    .withBucketName("output-bucket")
                    .withKey("test.log")
                    .withUploadId(result.getUploadId())
                    .withInputStream(inputStream)
                    .withPartNumber(1)
                    .withPartSize(5 * 1024 * 1024);

            partETags.add(amazonS3.uploadPart(uploadRequest).getPartETag());
        }

        log.info(result.getUploadId());

        CompleteMultipartUploadRequest compRequest = new CompleteMultipartUploadRequest(
                "output-bucket",
                "test.log",
                result.getUploadId(),
                partETags);

        amazonS3.completeMultipartUpload(compRequest);
}

Solution

  • Oh, I see. The InitiateMultipartUploadRequest needs to read from an input stream. This is a valid constraint, since you can only write to output streams in general.

    You probably heard that you can copy data from InputStream to ByteArrayOutputStream. Then take the resulting byte-array and create an ByteArrayInputStream. You could feed this to your request object. BUT: All data will in one byte array at a certain time. Since your use case is about large files, this cannot be o.k.

    What you need is to create a custom input stream class which transforms the original input stream into another input stream. It requires you to work on a byte level abstraction. It would however offer the best performance. I suggest to ask a new question if you like to know more about that.

    Your transformation code is already finished and you don't want to touch it again? There is another approach. You could also just "connect" an output stream to an input stream by using pipes: https://howtodoinjava.com/java/io/convert-outputstream-to-inputstream-example/. The catch: you are dealing with multi-threading here.