Search code examples
hadoopapache-pigelastic-map-reduce

POST Hadoop Pig output to a URL as JSON data?


I have a Pig job which analyzes log files and write summary output to S3. Instead of writing the output to S3, I want to convert it to a JSON payload and POST it to a URL.

Some notes:

  • This job is running on Amazon Elastic MapReduce.
  • I can use a STREAM to pipe the data through an external command, and load it from there. But because Pig never sends an EOF to external commands, this means I need to POST each row as it arrives, and I can't batch them. Obviously, this hurts performance.

What's the best way to address this problem? Is there something in PiggyBank or another library that I can use? Or should I write a new storage adapter? Thank you for your advice!


Solution

  • As it turns out, Pig does correctly send EOF to external commands, so you do have the option of streaming everything through an external script. If it isn't working, then you probably have a hard-to-debug configuration problem.

    Here's how to get started. Define an external command as follows, using whatever interpreter and script you need:

    DEFINE UPLOAD_RESULTS `env GEM_PATH=/usr/lib/ruby/gems/1.9.0 ruby1.9 /home/hadoop/upload_results.rb`;
    

    Stream the results through your script:

    /* Write our results to our Ruby script for uploading.  We add
       a trailing bogus DUMP to make sure something actually gets run. */
    empty = STREAM results THROUGH UPLOAD_RESULTS;
    DUMP empty;
    

    From Ruby, you can batch the input records into blocks of 1024:

    STDIN.each_line.each_slice(1024) do |chunk|
      # 'chunk' is an array of 1024 lines, each consisting of tab-separated
      # fields followed by a newline. 
    end
    

    If this fails to work, check the following carefully:

    1. Does your script work from the command line?
    2. When run from Pig, does your script have all the necessary environment variables?
    3. Are your EC2 bootstrap actions working correctly?

    Some of these are hard to verify, but if any of them are failing, you can easily waste quite a lot of time debugging.

    Note, however, that you should strongly consider the alternative approaches recommended by mat kelcey.