Search code examples
amazon-web-servicesamazon-s3etlamazon-emraws-glue

What is the difference between AWS Glue ETL Job and AWS EMR?


If I had to perform ETL on a huge dataset(say 1Tb) stored in S3 as csv files, Both AWS Glue ETL job and AWS EMR steps can be used. Then how is AWS Glue different from AWS EMR. And which is the better solution in this case.


Solution

  • Most of the differences are already listed so I'll focus more on the use case specific.

    When to choose aws glue

    1. Data size is huge but structured i.e. it is in the table structure and is of known format (CSV, parquet, orc, json).
    2. Lineage is required, if you need the data lineage graph while developing your etl job prefer developing the etl using glue native libraries.
    3. The developers don't need to tweak the performance parameters like setting number of executors, per executor memory and so on.
    4. You don't want the overhead of managing large cluster and pay only for what you use.

    When to use EMR

    1. Data is huge but semi-structured or unstructured where you can't take any benefit from Glue catalog.
    2. You believe only in the outputs and lineage is not required.
    3. You need to define more memory per executor depending upon the type of your job and requirement.
    4. You can manage the cluster easily or if you have so many jobs which can run concurrently on the cluster saving you money.
    5. In case of structured data, you should use EMR when you want more Hadoop capabilities like hive, presto for further analytics.

    So it depends on what your use case is. Both are great service.