Search code examples
scalaamazon-s3java-client

Using AWS S3 Java client to obtain directory and object metadata


Scala 2.11 here, although this concerns the AWS S3 Java client API so it's really a Java question. It would be awesome if someone can provide an answer in Scala, but I'll happily accept any Java answer that works (I can always Scala-ify it on my own time).


I am trying to use the AWS S3 client library to connect to a bucket on S3 which was the following directory structure underneath it:

my-bucket/
    3dj439fj9fj49j/
        data.json
    3eidi04d40d40d/
        data.json
    a874739sjsww93/
        data.json
    ...

Hence every immediate child object under the bucket is a directory with an alphanumeric name. I'll call these the "ID directories". And each of these ID directories all have a single child object that are all named data.json.

I need to accomplish several things:

  1. I need an array/map/datastruct of strings (Java Array<String> or Scala Array[String]) containing all the alphanumeric IDs of the ID directories (so element 0 is "3dj439fj9fj49j", element 1 is "3eidi04d40d40d", etc.); and
  2. I need an array/map/datastruct of dates (Java Array<Date> or Scala Array[Date]) containing the Last Modified timestamp of each ID directory's corresponding data.json file. So if mybucket/3dj439fj9fj49j/data.json had a Last Modified date/timestamp of, say, 2017-05-29 11:19:24T, then that datetime would be the first element of this second array
  3. These two arrays/maps/datastructs need to be associative, meaning I could access, say, the 4th element of the first (ID) array and get the 5th ID directory underneath my-bucket, and I could also access the 4th element of the second (date) array and get the Last Modified timestamp of the 5th ID directory's data.json child object

These don't necessarily have to be arrays, they could be maps, tuples, whatever. I just need 1+ data structures to hold this content as described above.

From the lib's Javadocs I see an ObjectMetadata#getLastModified field, but I don't see anything for reading parent directory paths for a given S3Object (meaning the data.json's parent ID directory). All in all, my best attempt is failing pretty spectacularly:

val s3Client = new AmazonS3Client(new BasicAWSCredentials(accessKey, secretKey))
val bucketRoot : S3Object = s3Client.getObject("myBucket","/")

// TODO: How to query 'bucketRoot' for all its child ID directories?
val idDirs : Array[S3Object] = ???

var dataMap : Map[String,Date] = null
idDirs.foreach(idDir ->
  // TODO: getName() and getChildSomehow() don't exist...obviously
  dataMap :+ idDir.getName() -> idDir.getChildSomehow("data.json").getObjectMetadata.getLastModified
)

Any S3 API gurus out there that can spot where I'm going awry, or nudge me in the right direction here? Thanks in advance!


Solution

  • You can call AmazonS3#listObjects(String) to get a list of objects in the bucket. The response will contain an S3ObjectSummary for each key found. You can call S3ObjectSummary#getLastModified() to get the last modified date/time.

    Here is an example that ties it all together with some Scala code.

    Input from S3 Bucket

    > aws s3 ls --recursive s3://<REDACTED>/
    2017-08-02 13:45:12          0 3dj439fj9fj49j/
    2017-08-02 13:45:28          0 3dj439fj9fj49j/data.json
    2017-08-02 13:45:16          0 3eidi04d40d40d/
    2017-08-02 13:45:33          0 3eidi04d40d40d/data.json
    2017-08-02 13:45:19          0 a874739sjsww93/
    2017-08-02 13:45:37          0 a874739sjsww93/data.json
    

    Code

    import collection.JavaConverters._
    
    import com.amazonaws.auth.AWSStaticCredentialsProvider
    import com.amazonaws.auth.BasicAWSCredentials
    import com.amazonaws.regions.Regions
    import com.amazonaws.services.s3.AmazonS3ClientBuilder
    
    val key = <REDACTED>
    val secret = <REDACTED>
    val bucketName = <REDACTED>
    val region = <REDACTED>
    
    val creds = new BasicAWSCredentials(key, secret)
    val s3 = AmazonS3ClientBuilder.standard.withCredentials(new AWSStaticCredentialsProvider(creds)).withRegion(region).build
    
    val objectSummaries = s3.listObjects(bucketName).getObjectSummaries.asScala
    val dataFiles = objectSummaries.filter { _.getKey.endsWith("data.json") }
    val dataDirectories = dataFiles.map(dataFile => {
      val keyComponents = dataFile.getKey.split("/")
      val parent = if (keyComponents.length > 1) keyComponents(keyComponents.length - 2) else "/"
      (parent, dataFile.getLastModified)
    })
    dataDirectories.foreach(println)
    

    Output

    (3dj439fj9fj49j,Wed Aug 02 13:45:28 PDT 2017)
    (3eidi04d40d40d,Wed Aug 02 13:45:33 PDT 2017)
    (a874739sjsww93,Wed Aug 02 13:45:37 PDT 2017)
    

    Explanation

    First, we have some bootstrapping to set up credentials and create the client. Then, we issue listObjects, which triggers a call to the S3 service. We filter those results to only keys ending with "data.json". Then, we map the results to tuples consisting of the parent path name and the object's last modified date/time. To determine the parent path, we split on the path separator and retrieve the previous path component. As a special case, if the file is in the root directory, then we say that its parent is "/".

    I chose to represent the results as tuples, but you can change this to some other data structure if you prefer.

    Note that for buckets containing a very large number of objects, you might want to use AmazonS3#listObjects(String, String) instead, so that you can restrict the results returned to keys matching a specific prefix. This will cut down the amount of network bandwidth consumed by the response and the amount of processing required on the response data.

    For even more options, you could also consider AmazonS3#listObjects(ListObjectsRequest) or AmazonS3#listObjectsV2(ListObjectsV2Request).