My S3 bucket contains multiple folders (Folder Name is in "YYYY-MM-DD HH:MM:SS" format). I want to read the data from latest folder using Glue job(Scala). Could you please help with an approach to do this?
Thanks
You will need to write code that lists the contents of the bucket and then determines which objects are the "latest".
Please note that folders do not actually exist in Amazon S3. For example, you could upload a file called 2022-10-29 08:20:33/foo.txt
and Amazon S3 will 'magically' create a folder with that date. Then, if the file is deleted, then the folder will magically 'disappear' (because it never existed).
When using the Create folder button in the S3 management console, a zero-length object is created with a filename (Key) equal to the name of the folder. This forces the folder to 'appear' even though it doesn't exist.
Also, folders don't maintain a date, so there is no ability to determine the "latest folder" based on a stored date value.
It is possible to obtain a list of folders by using list_objects_v2(Delimiter='/', ...)
, in which case a list of CommonPrefixes
is returned, which is the equivalent of listing the folder names at that level. You could do this to obtain all 'folder' names and then take the last one as being the latest.
Instead of relying on the date in the path, you could instead list all objects in the bucket and use LastModified
to determine which object is the latest (which might be more reliable than using the path name).