Scala 2.11 here, although this concerns the AWS S3 Java client API so it's really a Java question. It would be awesome if someone can provide an answer in Scala, but I'll happily accept any Java answer that works (I can always Scala-ify it on my own time).
I am trying to use the AWS S3 client library to connect to a bucket on S3 which was the following directory structure underneath it:
my-bucket/
3dj439fj9fj49j/
data.json
3eidi04d40d40d/
data.json
a874739sjsww93/
data.json
...
Hence every immediate child object under the bucket is a directory with an alphanumeric name. I'll call these the "ID directories". And each of these ID directories all have a single child object that are all named data.json
.
I need to accomplish several things:
Array<String>
or Scala Array[String]
) containing all the alphanumeric IDs of the ID directories (so element 0 is "3dj439fj9fj49j"
, element 1 is "3eidi04d40d40d"
, etc.); andArray<Date>
or Scala Array[Date]
) containing the Last Modified timestamp of each ID directory's corresponding data.json
file. So if mybucket/3dj439fj9fj49j/data.json
had a Last Modified date/timestamp of, say, 2017-05-29 11:19:24T, then that datetime would be the first element of this second arraymy-bucket
, and I could also access the 4th element of the second (date) array and get the Last Modified timestamp of the 5th ID directory's data.json
child objectThese don't necessarily have to be arrays, they could be maps, tuples, whatever. I just need 1+ data structures to hold this content as described above.
From the lib's Javadocs I see an ObjectMetadata#getLastModified
field, but I don't see anything for reading parent directory paths for a given S3Object
(meaning the data.json
's parent ID directory). All in all, my best attempt is failing pretty spectacularly:
val s3Client = new AmazonS3Client(new BasicAWSCredentials(accessKey, secretKey))
val bucketRoot : S3Object = s3Client.getObject("myBucket","/")
// TODO: How to query 'bucketRoot' for all its child ID directories?
val idDirs : Array[S3Object] = ???
var dataMap : Map[String,Date] = null
idDirs.foreach(idDir ->
// TODO: getName() and getChildSomehow() don't exist...obviously
dataMap :+ idDir.getName() -> idDir.getChildSomehow("data.json").getObjectMetadata.getLastModified
)
Any S3 API gurus out there that can spot where I'm going awry, or nudge me in the right direction here? Thanks in advance!
You can call AmazonS3#listObjects(String)
to get a list of objects in the bucket. The response will contain an S3ObjectSummary
for each key found. You can call S3ObjectSummary#getLastModified()
to get the last modified date/time.
Here is an example that ties it all together with some Scala code.
> aws s3 ls --recursive s3://<REDACTED>/
2017-08-02 13:45:12 0 3dj439fj9fj49j/
2017-08-02 13:45:28 0 3dj439fj9fj49j/data.json
2017-08-02 13:45:16 0 3eidi04d40d40d/
2017-08-02 13:45:33 0 3eidi04d40d40d/data.json
2017-08-02 13:45:19 0 a874739sjsww93/
2017-08-02 13:45:37 0 a874739sjsww93/data.json
import collection.JavaConverters._
import com.amazonaws.auth.AWSStaticCredentialsProvider
import com.amazonaws.auth.BasicAWSCredentials
import com.amazonaws.regions.Regions
import com.amazonaws.services.s3.AmazonS3ClientBuilder
val key = <REDACTED>
val secret = <REDACTED>
val bucketName = <REDACTED>
val region = <REDACTED>
val creds = new BasicAWSCredentials(key, secret)
val s3 = AmazonS3ClientBuilder.standard.withCredentials(new AWSStaticCredentialsProvider(creds)).withRegion(region).build
val objectSummaries = s3.listObjects(bucketName).getObjectSummaries.asScala
val dataFiles = objectSummaries.filter { _.getKey.endsWith("data.json") }
val dataDirectories = dataFiles.map(dataFile => {
val keyComponents = dataFile.getKey.split("/")
val parent = if (keyComponents.length > 1) keyComponents(keyComponents.length - 2) else "/"
(parent, dataFile.getLastModified)
})
dataDirectories.foreach(println)
(3dj439fj9fj49j,Wed Aug 02 13:45:28 PDT 2017)
(3eidi04d40d40d,Wed Aug 02 13:45:33 PDT 2017)
(a874739sjsww93,Wed Aug 02 13:45:37 PDT 2017)
First, we have some bootstrapping to set up credentials and create the client. Then, we issue listObjects
, which triggers a call to the S3 service. We filter
those results to only keys ending with "data.json". Then, we map
the results to tuples consisting of the parent path name and the object's last modified date/time. To determine the parent path, we split
on the path separator and retrieve the previous path component. As a special case, if the file is in the root directory, then we say that its parent is "/"
.
I chose to represent the results as tuples, but you can change this to some other data structure if you prefer.
Note that for buckets containing a very large number of objects, you might want to use AmazonS3#listObjects(String, String)
instead, so that you can restrict the results returned to keys matching a specific prefix. This will cut down the amount of network bandwidth consumed by the response and the amount of processing required on the response data.
For even more options, you could also consider AmazonS3#listObjects(ListObjectsRequest)
or AmazonS3#listObjectsV2(ListObjectsV2Request)
.