Search code examples
amazon-web-servicesamazon-s3filesystemshdfs

IllegalArgumentException, Wrong FS when specifying input/output from s3 instead of hdfs


I have been running my Spark job on a local cluster which has hdfs from where the input is read and the output is written too. Now I have set up an AWS EMR and an S3 bucket where I have my input and I want my output to be written to S3 too.

The error:

User class threw exception: java.lang.IllegalArgumentException: Wrong FS: s3://something/input, expected: hdfs://ip-some-numbers.eu-west-1.compute.internal:8020

I tried searching for the same issue and there are several questions regarding this issue. Some suggested that it's only for the output, but even when I disable output I get the same error.

Another suggestion is that there is something wrong with FileSystem in my code. Here are all of the occurances of input/output in my program:

The first occurance is in my custom FileInputFormat, in getSplits(JobContext job) which I have not actually modified myself but I can:

FileSystem fs = path.getFileSystem(job.getConfiguration());

Similar case in my custom RecordReader, also have not modified myself:

final FileSystem fs = file.getFileSystem(job);

In nextKeyValue() of my custom RecordReader which I have written myself I use:

FileSystem fs = FileSystem.get(jc);

And finally when I want to detect the number of files in a folder I use:

val fs = FileSystem.get(sc.hadoopConfiguration)
val status = fs.listStatus(new Path(path))

I assume the issue is with my code, but how can I modify the FileSystem calls to support input/output from S3?


Solution

  • The hadoop filesystem apis do not provide support for S3 out of the box. There are two implementations of the hadoop filesystem apis for S3: S3A, and S3N. S3A seems to be the preferred implementation. To use it you have to do a few things:

    1. Add the aws-java-sdk-bundle.jar to your classpath.
    2. When you create the FileSystem include values for the following properties in the FileSystem's configuration:

      fs.s3a.access.key
      fs.s3a.secret.key
      
    3. When specify paths on S3 don't use s3:// use s3a:// instead.

    Note: create a simple user and try things out with basic authentication first. It is possible to get it to work with AWS's more advanced temporary credential mechanisms, but it's a bit involved and I had to make some changes to the FileSystem code in order to get it to work when I tried.

    Source of info is here