Search code examples
databaseazuredatabricksazure-data-lake-gen2azure-cost-calculation

Best architecture for small to media data Project in Azure cloud


I am part of a small data project where the data ingestion pipelines (4) are running for over a 6-month period and managed to gather close to 90mbs of data. There is a scope for data volume to increase but it will not grow more than 1TB in any given time.

I was asked to design a reliable infrastructure to Ingest, Transform and push the data to PowerBI. I give them the following

Orchestration engine - ADF
Storage - DataLake gen 2
Transformation - DBX
Model/view - SQL
Front-end - PowerBI
Medallion architecture to store the data

I recognize the need to cut the cost wherever possible, So I recommended minimum requirements for each resource in the architecture. despite my best efforts client's tech lead is not on board with the proposal.

He thinks Databricks is overkill for the project's data volume and prefer to use store procedures.

He thinks he only needs the curated layer, not the landing and staging. I explained to him the medallion structure benefits but he still doesn't want them because the data is very simple and transformation is simple too that is the reasoning.

So my question is,

  1. Given that I have low volume data on the Azure cloud, should I consider using Databricks?

Note: I am not against the medallion model, I know we keep the landing data for future use cases, and staging will have a parquet file for easy processing of large volumes of data and curated for business logic. Please point out something else that I missed here. ''


Solution

  • I assume you meant 90 GBs not MBs. I also assumed this from pipeline running for months.

    When cost-reduction goal is in place, always focus in on compute but not storage costs. Lets start from your ADF or Databricks question. You are already paying extra for using ADF instead of writing your own code (Low-code tools are always more expensive, or at least in theory). Its a common fallacy that ADF is cheaper than Databricks. Normally you can make Databricks even cheaper than ADF, by writing your own code in it and using the smallest cluster available, and micro-managing that cluster. ADF pipelines will end up being expensive when the data volume increases. Its hard to find a good answer to your cheapest ETL tool in cloud question because all tools are rather expensive due to compute involved, So its not because databricks is more expensive because of the platform itself, but the cluster you use. If you use a very small cluster, costs should not be too high. For the same amount of compute and everything ADF will cost you more, but might be easier and faster to set up the transform logic. Not long ago, all clusters on ADF(Dataflows) used to be Databricks' own clusters.

    If you already have SQL server license, you can use SSIS and execute the packages from your ADF, this would be cheapest but the least efficient solution.

    Medallion is useful for example when you need to do data lineage, or when you need a separate backup of data that you can start from again when you make a mistake. Silver tables are useful because you dont need to create them from Bronze again, and some business processes directly feed from silver instead of gold. And please dont worry about storage costs, they are dirt cheap. So you can store these 3 layers without much accrued cost. Parquet in ADLS2 is great, and its the gold-standard.

    Its not a good idea to assume you will never reuse the data from the tables, there are many occasions I succumbed to going back to raw and starting everything all over again, so assumptions do not always work out. Gold layer, or the aggregated data has also the benefit of providing low latency due to performance for your end-point which in this case is PowerBI.

    Nevertheless, its always good to have a copy of data available, because in the data world, sometimes when we mess up, there is no way to reverse it. Hope I was helpful