Search code examples
pythonamazon-web-servicesazuredatabrickscost-management

How can I reduce the financial cost of working in databricks?


was just wondering whether anyone had any thoughts on best practices when working in databricks. It is financially costing a lot to develop within databricks, hence would like to know where else it would be best to develop python code in. With thought also to collaborative work, is there a similar set up to databricks for collaborative work that is free or of little cost to use.

Any suggestions, greatly appreciated!


Solution

  • The cost of Databricks is really related to the size of the clusters you are running (1 worker, 1 driver or 1 driver 32 workers?), the spec of the machines in the cluster (low RAM and CPU or high RAM and CPU), and how long you leave them running (always running or short time to live, aka "Terminate after x minutes of inactivity". I am also assuming you are not running the always on High Concurrency cluster mode.

    Some general recommendations would be:

    • work with smaller datasets in dev, eg representative samples which would enable you to...
    • work with smaller clusters in dev, eg instead of working with large 32 node clusters, work with 2 node small clusters
    • set time to live as short eg 15 mins
    • which together would reduce your cost

    Obviously there is a trade-off in assembling representative samples and making sure your outputs are still accurate and useful but that's up to you.