Search code examples
azuredatabricksazure-databricksdelta-live-tablesdatabricks-community-edition

Best practices to reduce the cost for Delta Live tables in Azure Databricks


I'm currently in the process of establishing Delta live tables in Azure Databricks to support real-time use cases. Specifically, for a given data source, let's assume I have 10 tables. I've set up a single Delta live pipeline for these tables with a scheduling frequency of every 3 hours. However, this approach is proving to be quite costly. I'm seeking guidance on best practices for optimizing my use of Delta live tables.

Here are a few additional details to consider:

Data Format: CSV Cluster bandwidth while running DLT pipeline: Fixed bandwidth with 4 worker and 1 worker Full Load Data Volume: Exceeds 250 million records. Incremental Load Data Volume: Over 10 million records

Please suggest me few best practices i should take to reduce the cost


Solution

  • Absolutely, above points are up to the mark. Below are some more points I have experienced and helped me to reduce cost.

    Optimizing Clusters:

    • Create well-sized clusters with proper configurations. Databricks automatically sets Runtime Version for delta live pipelines based on data and performance.
    • Use Ganglia charts to tweak Driver and worker types for better CPU and memory usage.

    Cost Efficiency and Usage:

    • Ensure clusters are fully utilized to avoid unnecessary costs.
    • Analyze Ganglia charts to optimize Delta live pipelines. Run more notebooks efficiently under the same pipeline using insights.

    Mode Variations and Cost Control: Differences between Development and Production modes. Note the extra 2-hour cluster runtime after job completion in Development mode, impacting costs. In Production mode, clusters end immediately after job completion.

    Adjust "Development mode" settings to manage costs by changing cluster shutdown delays in Pipeline settings. set pipelines.clusterShutdown.delay to 60s