Search code examples
hadoopamazon-web-serviceshbasestorageemr

Use case HBase on EMR


I read the documentation on AWS, but a point is still unclear.

Is S3 the primary storage of EMR cluster? or does the data are in EC2 and S3 is just a copy?

In the doc :

  • "HBase on Amazon EMR provides the ability to back up your HBase data directly to Amazon Simple Storage Service (Amazon S3)"

  • "Hadoop clusters running on Amazon EMR use EC2 instances as virtual Linux servers for the master and slave nodes, Amazon S3 for bulk storage of input..."

  • "provides the ability to launch a new cluster and populate it with data from a previous HBase backup"

My use case : Use HBASE to store TB of data. Update my tables only three or two times a month by starting an emr cluster. Tables store on S3.


Solution

  • The key question in your use case is how the data should be available between updates.

    If your goal is to have data accessible through a Hbase interface all the time then a Hbase cluster (like on EMR) would need to be up and running continually. Hbase currently only supports HDFS as live storage for Hfiles. S3 storage is external to the cluster and thus can be used as a destination for backups or other ingress/egress of data.