In Azure, I enabled all Diagnostic Logs
of my Databricks Workspace. I looked at all table, especially DatabricksClusters
and Usage
however, I didn't find any entry that would help me to create an alert when the maximum number of workers is reached.
I want to monitor databricks to find out when I have to increase the upper worker limit/SKU.
There are few approaches to that:
Use diagnostic logs with Log Analytics. Diagnostic logs include cluster events from which we can use resize
and resizeResult
fields. The resize
is primarily used by DLT pipelines, for all other clusters we need to use resizeResult
event which includes clusterWorkers
field with the number of workers allocated after resize. The main problem with this approach is that this event doesn't include the information about max_workers
field, so you will need somehow join create
and edit
events to obtain max workers, but this could be problematic if changes to the cluster configuration were done a long time ago, and no information is kept in the log analytics.
Recently Databricks started a public preview of so-called system tables that contains the same information as in the diagnostic logs (and more tables are coming), but it's stored for a longer time, so it's easier to join events like resizeResult
with cluster information. Then you can use Databricks SQL Alerts to send notifications. You can find more information about system usage for notifications in the recent blog post that also contains reusable queries, etc.
Setup project Overwatch that consolidates diagnostic logs + cluster logs + some other information to provide better insights into what happens in the workspace & individual clusters. But Overwatch is slowly being replaced by system tables.