Search code examples
jsonhadoopamazon-s3emr

Recommended Format for loading data into hadoop, for simple map reduce


Currently I'm loading pretty massive data into our redshift clusters from s3 (10k rows per second or so?).

This is become a problem with trying to run any queries on the data, as even when trying to rollup a couple hours worth of data we run into out of memory errors.

What I'd like to do is run a map reduce job on the data, and just load in the aggregates. I know this is suppose to be a fairly easy task, but I'm completely new to hadoop, and I'm sorta stuck on one of the first steps.

  1. Setup EMR Cluster (done)
  2. Load Data into HDFS (I think this is what I'm suppose to do)

Currently all the data is being loaded into S3 gzipped JSON files (makes it easy to load into redshift). Do I have to change the file format to get it into hadoop? Each S3 File takes on something similar to this form:

{  
    "timestamp":"2015-06-10T11:54:34.345Z", 
    "key":"someguid", 
    "device": { "family" : "iOS", "versions" : "v8.4" } 
}
{
    "timestamp":"2015-06-11T15:56:44.385Z", 
    "key":"some second key", 
    "device": { "family" : "Android", "versions" : "v2.2" }
} 

Where each JSON object is 1 record/row. (Notice the JSON objects are one after another, in the real files there is no whitespace, no commas separating the json objects or anything like that either).

It's not a huge deal for me to change the format of these files to something that will just work, but I'm not sure what that format would be (plain CSV files? Can I still gzip them?).

So questions are:

  1. Is it possible to work with these files as is? If so would I have less headaches just changing them anyway
  2. After I get my files right and I can import them, what is the easiest way to accomplish my goal of simply rolling up this data by hour, and saving the file back to S3 so I can load it into redshift? I ideally would like this job to run every hour, so my redshift tables are updated hourly with the previous hours of data. What technologies should I be reading about to accomplish this? Hive? Impala? Pig? Again, just looking for the simple solution.

Solution

  • From sample data it is clear that your data is into JSON format. You can use any one from Map/Reduce, Pig, Hive to read and retrieve record.

    Pig and Hive are more simpler then Map/Reduce as you don't need to write too much code.

    If you are planning to read data from Hive then you can use Hive JSON sarde. More detail about implementation is available on How do you make a HIVE table out of JSON data?

    If you are planning to use pig then you ca use JsonLoader during the pig load statement.You can get more detail about JsonLoader on this link http://joshualande.com/read-write-json-apache-pig/

    You can also write your costume UDF in pig and hive to read JSON data.