I have a Pig job which analyzes log files and write summary output to S3. Instead of writing the output to S3, I want to convert it to a JSON payload and POST it to a URL.
Some notes:
What's the best way to address this problem? Is there something in PiggyBank or another library that I can use? Or should I write a new storage adapter? Thank you for your advice!
As it turns out, Pig does correctly send EOF to external commands, so you do have the option of streaming everything through an external script. If it isn't working, then you probably have a hard-to-debug configuration problem.
Here's how to get started. Define an external command as follows, using whatever interpreter and script you need:
DEFINE UPLOAD_RESULTS `env GEM_PATH=/usr/lib/ruby/gems/1.9.0 ruby1.9 /home/hadoop/upload_results.rb`;
Stream the results through your script:
/* Write our results to our Ruby script for uploading. We add
a trailing bogus DUMP to make sure something actually gets run. */
empty = STREAM results THROUGH UPLOAD_RESULTS;
DUMP empty;
From Ruby, you can batch the input records into blocks of 1024:
STDIN.each_line.each_slice(1024) do |chunk|
# 'chunk' is an array of 1024 lines, each consisting of tab-separated
# fields followed by a newline.
end
If this fails to work, check the following carefully:
Some of these are hard to verify, but if any of them are failing, you can easily waste quite a lot of time debugging.
Note, however, that you should strongly consider the alternative approaches recommended by mat kelcey.