Search code examples
hadoopamazon-s3apache-pigelastic-map-reduceamazon-emr

Hadoop Pig save each line of a file to S3


Currently, I have Pig script running on top of Amazon EMR to load a bunch of files from S3 and then I will do the filter processing and group the data into phone number, so the data will be like (phonenumber:chararray, bag:{mydata:chararray}). Next I will have to store each phone number into different S3 buckets (possibly buckets in different accounts that I have access to). Seems org.apache.pig.piggybank.storage.MultiStorage is the best use at here, but it doesn't work, as there are 2 problems I am facing:

  1. There are a lot of phone numbers (approximate 20,000), to store each phone number into different S3 buckets is very very slow and the program is even out of memory.
  2. There is no way for me to look up my lookup table to decide where is the buckets to store into.

So I am wondering if anyone can help out? The second one probably can solve by written up my own UDF store function, but for the first one, how to solve it? Thanks.


Solution

  • S3 is limited to 100 buckets per account. More than that, the creation of a bucket is not immediate, as you need to wait for the bucket to be ready.

    However you can have as many objects as you want in a bucket. You can write the phone numbers as different object relatively quick. Especially if you are taking care at the name of your objects: objects in S3 are stored by prefix. If you are giving all your objects that same prefix, S3 will try to put all of them on the same "hot" area, getting less performance. If you choose the prefix to be different (usually simply reverse the id or time), you will improve it significantly.

    You can also take a look at DynamoDB, which is a scalable NoSQL DB in AWS. You can get very high throughput for the time of building your index. You can later export it to S3 as well, using Hive over Elastic MapReduce.