Currently, I have Pig script running on top of Amazon EMR to load a bunch of files from S3 and then I will do the filter processing and group the data into phone number
, so the data will be like (phonenumber:chararray, bag:{mydata:chararray})
. Next I will have to store each phone number into different S3 buckets (possibly buckets in different accounts that I have access to). Seems org.apache.pig.piggybank.storage.MultiStorage
is the best use at here, but it doesn't work, as there are 2 problems I am facing:
So I am wondering if anyone can help out? The second one probably can solve by written up my own UDF store function, but for the first one, how to solve it? Thanks.
S3 is limited to 100 buckets per account. More than that, the creation of a bucket is not immediate, as you need to wait for the bucket to be ready.
However you can have as many objects as you want in a bucket. You can write the phone numbers as different object relatively quick. Especially if you are taking care at the name of your objects: objects in S3 are stored by prefix. If you are giving all your objects that same prefix, S3 will try to put all of them on the same "hot" area, getting less performance. If you choose the prefix to be different (usually simply reverse the id or time), you will improve it significantly.
You can also take a look at DynamoDB, which is a scalable NoSQL DB in AWS. You can get very high throughput for the time of building your index. You can later export it to S3 as well, using Hive over Elastic MapReduce.