Search code examples
dataframeamazon-s3pysparks3-bucket

pyspark List subfolder of a folder in s3 bucket


I have a s3 bucket in which i store datafiles that are to be processed by my pyspark code. the folder i want to access is:

s3a://bucket_name/data/

this folder contains folder. my aim is to access the content of last added folder in this directory. I didnot want to use boto for some reasons. is there any way to access the folder list so i can pick the folder that i suppose to access. I can access files if i specify the folder but i want to make it dynamic.


Solution

  • I recommend using s3fs, which is a filesystem-style wrapper on boto3. The docs are here: http://s3fs.readthedocs.io/en/latest/

    Here's the part you care about (you may have to pass in or otherwise configure your AWS credentials):

    import s3fs
    fs = s3fs.S3FileSystem(anon=True)
    fs.ls('my-bucket')