Search code examples
hadoopapache-pigemramazon-emr

Running Pig Script on EMR


So I am using the following file as input: https://svn.apache.org/repos/asf/pig/trunk/tutorial/data/excite-small.log

and the code I have right now is

-- FileName: excite-small.log
log  = LOAD 'excite-small.log' AS (user, timestamp, query);
grpd = GROUP log BY user;
cntd = FOREACH grpd GENERATE group, COUNT(log);
STORE cntd INTO 'output'

I run this job on EMR using the steps mentioned at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-pig-launch.html

** I have set the following parameters **

1. For Script Location: s3://mybucket/test.pig
2. For Input Location:  s3://mybucket/excite-small.log
3. For Output Location: s3://mybucket/
4. Arguments: Blank

When I run this job, I get an error as Input path does not exist. I think this is got to do with REGISTER but I am not really sure. Could anyone suggest want am I doing wrong?


Solution

  • In your PIG script, refer to the input file in full, eg:

    log  = LOAD 's3://mybucket/excite-small.log' AS (user, timestamp, query);
    

    Or, use the passed-in INPUT path:

    log = LOAD '$INPUT' AS (user, timestamp, query);
    

    Found a good explanation here: