So I am using the following file as input: https://svn.apache.org/repos/asf/pig/trunk/tutorial/data/excite-small.log
and the code I have right now is
-- FileName: excite-small.log
log = LOAD 'excite-small.log' AS (user, timestamp, query);
grpd = GROUP log BY user;
cntd = FOREACH grpd GENERATE group, COUNT(log);
STORE cntd INTO 'output'
I run this job on EMR using the steps mentioned at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-pig-launch.html
** I have set the following parameters **
1. For Script Location: s3://mybucket/test.pig
2. For Input Location: s3://mybucket/excite-small.log
3. For Output Location: s3://mybucket/
4. Arguments: Blank
When I run this job, I get an error as Input path does not exist
. I think this is got to do with REGISTER
but I am not really sure. Could anyone suggest want am I doing wrong?
In your PIG script, refer to the input file in full, eg:
log = LOAD 's3://mybucket/excite-small.log' AS (user, timestamp, query);
Or, use the passed-in INPUT path:
log = LOAD '$INPUT' AS (user, timestamp, query);
Found a good explanation here: