Search code examples
amazon-web-servicesamazon-s3aws-glue

Read data from latest folder in S3 bucket


My S3 bucket contains multiple folders (Folder Name is in "YYYY-MM-DD HH:MM:SS" format). I want to read the data from latest folder using Glue job(Scala). Could you please help with an approach to do this?

Thanks


Solution

  • You will need to write code that lists the contents of the bucket and then determines which objects are the "latest".

    Please note that folders do not actually exist in Amazon S3. For example, you could upload a file called 2022-10-29 08:20:33/foo.txt and Amazon S3 will 'magically' create a folder with that date. Then, if the file is deleted, then the folder will magically 'disappear' (because it never existed).

    When using the Create folder button in the S3 management console, a zero-length object is created with a filename (Key) equal to the name of the folder. This forces the folder to 'appear' even though it doesn't exist.

    Also, folders don't maintain a date, so there is no ability to determine the "latest folder" based on a stored date value.

    It is possible to obtain a list of folders by using list_objects_v2(Delimiter='/', ...), in which case a list of CommonPrefixes is returned, which is the equivalent of listing the folder names at that level. You could do this to obtain all 'folder' names and then take the last one as being the latest.

    Instead of relying on the date in the path, you could instead list all objects in the bucket and use LastModified to determine which object is the latest (which might be more reliable than using the path name).