Search code examples
apache-pigemr

What is the path for a bootstrapped file for a Pig job running in Amazon EMR


I bootstrap a data file in my EMR job. The bootstrapping succeeds and the file is copied to /home/hadoop/contents/ folder with right permissions.

However when I try to access it in the Pig script like below:

userdidstopick = load '/home/hadoop/contents/UserIdsToPick.txt' AS (uid:chararray); 

I get an error that the input path does not exist:

 hdfs://10.183.166.176:9000/home/hadoop/contents/UserIdsToPick.txt

When running Ruby jobs the bootstrapped file was always accessible under /home/hadoop/contents/ folder and everything worked for me.

Is it different for Pig?


Solution

  • By default Pig on EMR is configured to access HDFS location instead of local filesystem. The error shows the HDFS location.

    There are 2 ways to solve this:

    1. Either copy the file on S3, and directly load file from s3

      userdidstopick = load 's3_bucket_location/UserIdsToPick.txt' AS (uid:chararray);

    2. Or you can first copy the file on HDFS (instead of local filesystem), and then directly use it as path you are doing today.

    I would prefer first option.