Search code examples
apache-sparkamazon-s3amazon-emrhadoop-partitioning

How to prevent bucket creation if it is not exists in spark on emr


I'm, running spark step on emr cluster. it gathers all small files and accumulated them to one big file. So i receive list of buckets to process, but before processing bucket i want to check if bucket exists and if it contains any files. For that purpose i'm using hadoop FileSystem.

     String bucketPath = "s3n://" + bucketName;
     Configuration hadoopConfiguration =   
     sparkSession.sparkContext().hadoopConfiguration();
     FileSystem.get(new URI(bucketPath), hadoopConfiguration);

But the issue here that FileSystem.get(...) creates a bucket if its not exists. Is it possible to prevent bucket creation?or does somebody know how to check existence in another way?


Solution

  • The best way to disable this is with the "fs.s3.buckets.create.enabled" hadoop config. This feature is going to be disabled in newer versions of emr in the near future to prevent accidentally creating s3 buckets and to improve startup performance.