Search code examples
amazon-web-servicesetldata-warehousebusiness-intelligenceaws-glue

Synchronize all data opearions under the same tool


In my company, we have Pentaho Kettle doing our ETL jobs daily (one machine is sufficient for all the data we have), which means:

  • reading data from different, mostly relation databases, Spreadsheets, and API's
  • applying transformations, and inserting data into Redshift
  • Doing API calls to external SAAS tools

We are planning on redoing it, in a more advanced tool, which would enable us:

  • update dwh more frequently than once a day
  • easier get and push data to SAAS APIs we use (processing and composing JSONs in Pentaho is painful )
  • Enable triggering other workloads in a workflow (for example python scripts)
  • Synchronize machine learning pipelines that are running on EC2 machines
  • Be ready to 5x data scale in a one year time (where 1 machine might not be enough)

Something that occurs to my mind are Luigi or Airflow as workflow managers, and doing the code based ETL using python? Since our entire infrastructure is on the AWS cloud, I see that AWS Glue is also appearing now as an option (i do not know if it is only for etl, or also can be used for other processes that we plan to include)

Is there any other solution? Does anyone have experience using these (especially how they work with redshift, s3, possibly triggering in the future spark/kinesis workloads)?

If yes, what are the libraries to use, and a good place to start and learn?


Solution

  • Sorry, but it is almost impossible to answer questions like these. Each company and team are different. What works for us won't necessarily work for you.

    I can, however, provide some general advice:

    Play to your strengths. If your team is full of strong C# coders don't choose python. If you know SQL Server inside and out and choose their ETL tool.

    Planning. This is the most important step. Make sure you have fully analysed and documented how the new ETL solution will work. Identifying and solving all the complex problems up front will lead to shorter development times and a tidier solution. A greater understanding of the nitty-gritty will also help you to evaluate the various tools and frameworks on offer. At the end of this process you should know:

    • How long the development will take.
    • What functionality you require from your ETL tool.
    • How the ETL will be organised/monitored/updated.
    • What the major milestones are.

    If you plan properly it won't matter what technologies you use.

    Prototype and test. Especially important if you are using a tool or framework for the first time. At the very least test major functionality before committing to an approach. A company I once worked for spent tens of thousands of pounds on an ETL solution. The day after it was installed we discovered it did not support our CRM tool. No workaround could be found and we were forced to buy a second ETL tool. Very expensive mistake.

    Don't aim at a moving target. Ideally, the new and old ETL system should use exactly the same source data and populate exactly the same reporting tables. This dramatically simplifies testing. This allows you to dual run. This allows you to fall back to the old solution should you need to. Save the new fancy stuff for release two.

    Code. Don't write any code until all the other steps have been completed (prototypes/tests aside). When you fully understand a problem the code (almost) writes itself.

    For context; I manage a 3 billion record data warehouse for a large multi-national. I have made every single mistake I am warning you against.