Search code examples
amazon-s3emramazon-emr

Using data present in S3 inside EMR mappers


I need to access some data during the map stage. It is a static file, from which I need to read some data.

I have uploaded the data file to S3.

How can I access that data while running my job in EMR?
If I just specify the file path as:

s3n://<bucket-name>/path

in the code, will that work ?

Thanks


Solution

  • What I ended up doing:

    1) Wrote a small script that copies my file from s3 to the cluster

    hadoop fs -copyToLocal s3n://$SOURCE_S3_BUCKET/path/file.txt  $DESTINATION_DIR_ON_HOST
    

    2) Created bootstrap step for my EMR Job, that runs the script in 1).

    This approach doesn't require to make the S3 data public.