Search code examples
amazon-ec2amazon-emr

Amazon EMR vs EC2 for Off loading BI & Analytics anno 2018


I looked at some posts but they are a bit older on this topic. I have read the AWS and other blogs as well, but ...

My simple non-programming question for AWS in today's environment is:

  • If we have a DWH of say, 20+TB and growing, that we want to off-load to the Cloud as many are doing, then

    • if we have a regular daily DWH feed with some mutations, then

      • should we in the case of AWS, use EMR or EC2?
    • Moreover, it is a complete batch environment, no Streaming or KAFKA requirements. Usage of SPARK for sure.

EMR seems great, but I have the impression it is for Data Scientists to do whatever they want whenever they want. For more regular ETL I am wondering if this is suited. The appeal of less management is certainly a boon.

In the docs on AWS I cannot find a definitive answer, hence this question.

My impression is that with AMI and bootstrapping own services, that EMR is certainly one way to go, and, that EC2 would be more for a KAFKA Cluster or if you really want to control your own environment and tooling completely based on say Cloudera's distribution per se.


Solution

  • So, the answer here is for others that may need to assess which options apply for off-loading, whatever. It is actually not so hard in hindsight. Note that AZURE and non-AWS vendors not considered here. In a nutshell, then:

    EMR is an (PaaS) AWS Managed Hadoop Service

    EMR provides tools that AMAZON feel will do the job for Data Science, Analytics, etc. But you can "bootstrap" your own requirements / software, if needed.

    EMR-clusters comprise short-running EC2 instances and provisioning happens under water as it were. You get patches effected easily this way. You can up- and downscale very easily as well. Compute and storage are divorced allowing this scaling to occur easily.

    Elasticity applies obviously more so to compute, data needs to be there as long as you need it. EMR relies on S3 to save results to, longer term. After saving, one terminates the EMR-cluster, and when required, start a new EMR-cluster and attach your saved S3 results - if applicable - to this new cluster. EMRFS allows S3 to look like part of HDFS and provides easy access. EBS-backed storaged exists that allows saving of results to storage tied to the EC2-instance for the duration of that instance.

    It's a new way of doing things. One has access to "spot" instances with obviously spot prices. Billing is less predictable as it depends what you do, but could well overall be cheaper - provided governed correctly. An example of this is expedia's management of EMR-clusters.

    Ad-hoc querying is not well served with S3, so you will need another AWS Managed Services such as Presto / Athena or Redshift (Spectrum) which is an additional set of services and cost. Just mentioning this due to slower S3 performance.

    EC2 (IaaS) is more "traditional"

    You elect to take this path if you want to provision EC2 instances yourself a syou want control of the software and what you want on your Hadoop environment.

    EC2 instances - VMs - have compute power, memory, EBS-backed temporal storage, and use EFS for file systems for HDFS or, say, KUDU, and S3. S3 access is not as easy to access as under EMRFS with EMR.

    You install and maintain the Hadoop software yourself and apply patches, etc. Management of Hadoop on these EC2 instances is of course less of a big deal with Cloudera and Cloudbreak.

    Billing is more predictable one could argue, on the basis of up-time of an EC2 instance, and billing applies continuously for any persisted storage.

    Important point, one can combine an EC2 approach for, say, DWH Loading on Hadoop - if "off-loading", and EMR Clusters for Data Science.

    MR Data Locality

    This not adhered to in both approaches unless bare metal options used, but then the elasticity - E - is harder for both parties, which allows cost savings.

    Data locality seems to be assumed by most, but actually it has gone with Cloud computing as expected, and seems quite OK in terms of performance for Data Science etc.

    For ad hoc querying AMAZON say they are not so sure on S3, and from experience, using EFS fof HDFS/PARQUET or KUDU works pretty quick, to say the least, from my experience at least.