Search code examples
hadoopapache-sparkamazon-s3emramazon-glacier

How can I couple Amazon Glacier / S3 with hadoop map reduce / spark?


I need to process data stored in Amazon S3 and Amazon Glacier with Hadoop / EMR and save the output data in an RDBMS for eg. Vertica

I am a total noob in big data. And I have only gone through few online sessions and ppts about map reduce and sparx. And created few dummy map reduce codes for learning purpose.

Till now I only have commands that let me import data from S3 to HDFC in Amazon EMR and after processing they store them in HDFS files.

So here are my questions:

  • Is it really mandatory to sync data from S3 to HDFC first before executing map reduce or is there a way to use S3 directly.`

  • How can I make hadoop access Amazon Glacier data`

  • And finally how can I store the output to Database.`

Any suggestion / reference is welcome.


Solution

  • EMR clusters are able to read/write to/from S3, so no need to copy data to the cluster. S3 has an implementation as Hadoop FileSystem so it can mostly be treated the same as HDFS.

    AFAIK your MR/Spark jobs cannot directly access data from Glacier, data has to first be downloaded from glacier, by itself a lengthy procedure.

    Check out Sqoop for pumping data between HDFS and DB