Search code examples
apache-sparkpysparkdatabricksazure-databrickscomputation-theory

What is the industry standard for number of clusters for a development team in Databricks?


I am a part of a team of 5 developers that work with gathering data, transforming, analyzing and predicting data in Azure Databricks (basically a combination of Data Science and Data Engineering). Up until now we have been working on relatively small data, so the team of 5 could easily work with a single cluster with 8 worker nodes in development. Even though we are 5 developers, usually we're at maximum 3 developers in Databricks at the same time.

Recently we started working with "Big Data" and thus we need to make use of Databricks' Apache Spark parallelization methods to improve run-times for our codes. However, a problem that quickly came to light is that with more than one developer running parallelizing codes on a single cluster, there will be queues that slow us down. Because of this we have been thinking about increasing the amount of clusters in our dev-environment so that multiple developers can work on codes that take use of the Spark parallelizing methods.

My question is this: What is the industry standard for number of clusters to have in a development environment? Do teams usually have a cluster per developer? That sounds like it could easily become quite expensive in terms of economic costs.


Solution

  • Usually I see following pattern:

    • There is a shared cluster for many people for adhoc experimenting, "small" data processing, etc. Please notice that current versions of databricks runtimes is trying to split resources between all users.

    • If some people need to run something "heaviweight", like, integration tests, etc., closer to production workloads, it's allowed them to create clusters. But to control costs, etc. it's recommended to use cluster policies to limit a size of cluster to create, node types, auto-termination times, etc.

    • For development clusters it's ok to use spot instances because Databricks cluster manager will pull new instances if existing ones are evicted

    • SQL queries could be more efficient to run on SQL warehouses that are optimized for BI workloads

    P.S. Really, integration tests, and similar things could be easily run as jobs that are less expensive