Search code examples
javaamazon-s3documentaws-java-sdk

Fetching specific fields from an S3 document


I am using AWS Java SDK in my application to talk to one of my S3 buckets which holds objects in JSON format.

A document may look like this:

{
    "a" : dataA,
    "b" : dataB,
    "c" : dataC,
    "d" : dataD,
    "e" : dataE
} 

Now, for a certain document lets say document1 I need to fetch the values corresponding to field a and b instead of fetching the entire document.

This sounds like something that wouldn't be possible because S3 buckets can have any type of documents in them and not just JSONs.

Is this something that is achievable though?


Solution

  • That's actually doable. You could do selects like you've described, but only for particular formats: JSON, CSV, Parquet.

    Imagine having a data.json file in so67315601 bucket in eu-central-1:

    {
      "a": "dataA",
      "b": "dataB",
      "c": "dataC",
      "d": "dataD",
      "e": "dataE"
    }
    

    First, learn how to select the fields via the S3 Console. Use "Object Actions" → "Query with S3 Select":

    enter image description here enter image description here


    AWS Java SDK 1.x

    Here is the code to do the select with AWS Java SDK 1.x:

    @ExtendWith(S3.class)
    class SelectTest {
        @AWSClient(endpoint = Endpoint.class)
        private AmazonS3 client;
    
        @Test
        void test() throws IOException {
            // LINES: Each line in the input data contains a single JSON object
            // DOCUMENT: A single JSON object can span multiple lines in the input
            final JSONInput input = new JSONInput();
            input.setType(JSONType.DOCUMENT);
    
            // Configure input format and compression
            final InputSerialization inputSerialization = new InputSerialization();
            inputSerialization.setJson(input);
            inputSerialization.setCompressionType(CompressionType.NONE);
    
            // Configure output format
            final OutputSerialization outputSerialization = new OutputSerialization();
            outputSerialization.setJson(new JSONOutput());
    
            // Build the request
            final SelectObjectContentRequest request = new SelectObjectContentRequest();
            request.setBucketName("so67315601");
            request.setKey("data.json");
            request.setExpression("SELECT s.a, s.b FROM s3object s LIMIT 5");
            request.setExpressionType(ExpressionType.SQL);
            request.setInputSerialization(inputSerialization);
            request.setOutputSerialization(outputSerialization);
    
            // Run the query
            final SelectObjectContentResult result = client.selectObjectContent(request);
    
            // Parse the results
            final InputStream stream = result.getPayload().getRecordsInputStream();
    
            IOUtils.copy(stream, System.out);
        }
    }
    

    The output is:

    {"a":"dataA","b":"dataB"}
    

    AWS Java SDK 2.x

    The code for the AWS Java SDK 2.x is more cunning. Refer to this ticket for more information.

    @ExtendWith(S3.class)
    class SelectTest {
        @AWSClient(endpoint = Endpoint.class)
        private S3AsyncClient client;
    
        @Test
        void test() throws Exception {
            final InputSerialization inputSerialization = InputSerialization
                .builder()
                .json(JSONInput.builder().type(JSONType.DOCUMENT).build())
                .compressionType(CompressionType.NONE)
                .build();
    
            final OutputSerialization outputSerialization = OutputSerialization.builder()
                .json(JSONOutput.builder().build())
                .build();
    
            final SelectObjectContentRequest select = SelectObjectContentRequest.builder()
                .bucket("so67315601")
                .key("data.json")
                .expression("SELECT s.a, s.b FROM s3object s LIMIT 5")
                .expressionType(ExpressionType.SQL)
                .inputSerialization(inputSerialization)
                .outputSerialization(outputSerialization)
                .build();
            final TestHandler handler = new TestHandler();
    
            client.selectObjectContent(select, handler).get();
    
            RecordsEvent response = (RecordsEvent) handler.receivedEvents.stream()
                .filter(e -> e.sdkEventType() == SelectObjectContentEventStream.EventType.RECORDS)
                .findFirst()
                .orElse(null);
    
            System.out.println(response.payload().asUtf8String());
        }
    
        private static class TestHandler implements SelectObjectContentResponseHandler {
            private SelectObjectContentResponse response;
            private List<SelectObjectContentEventStream> receivedEvents = new ArrayList<>();
            private Throwable exception;
    
            @Override
            public void responseReceived(SelectObjectContentResponse response) {
                this.response = response;
            }
    
            @Override
            public void onEventStream(SdkPublisher<SelectObjectContentEventStream> publisher) {
                publisher.subscribe(receivedEvents::add);
            }
    
            @Override
            public void exceptionOccurred(Throwable throwable) {
                exception = throwable;
            }
    
            @Override
            public void complete() {
            }
        }
    }
    

    As you see, it's possible to make S3 selects programmatically!

    You might be wondering what are those @AWSClient and @ExtendWith( S3.class )?

    This is a small library to inject AWS clients in your tests, named aws-junit5. It would greatly simplify your tests. I am the author. The usage is really simple — try it in your next project!