Search code examples
scalaapache-sparkemr

How can I check if a s3path exists or not in Spark [using scala]?


I'm looking for a cleaner way to check if a s3path is empty or not.

My current code looks like this,

 if (!s3Path.isEmpty) {
  try {
    var rdd = sc.textFile(s3Path)
    rdd.partitions.size
  } catch {
    case _: org.apache.hadoop.mapred.InvalidInputException =>
      (sc.parallelize(List()))
  }
}

I want to do it without creating an RDD.


Solution

  • I check s3path and see if its valid then I pass it to Spark to create RDD like below

     public boolean checkIfS3PathsValid(String bucketName, String key)
    {
    
        try{
            ObjectListing list = s3.listObjects(bucketName,key);
            List<S3ObjectSummary> objectInfoList = list.getObjectSummaries();
    
            if(objectInfoList.size()>0)
            {
                return true;
            }
            else
            {
                return false;
            }
        }
        catch (Exception e) 
        {
            e.printStackTrace();
            return false;
        }
    }
    

    here s3 is com.amazonaws.services.s3.AmazonS3 and you initialise it by

    s3=AmazonS3Client(new PropertiesCredentials("path of your s3 credential file"));
    

    So in you code call the checkIfS3PathsValid and see if it return true. If Yes , then only you create RDD using sc.textfile other wise you ignore that s3path.