Search code examples
azureazure-machine-learning-service

How is Azure Machine Learning's average GpuUtilization metric computed?


How is the "GpuUtilization" metric computed for an Azure Machine Learning (AML) workspace? What are the inputs and what is the equation used to compute GpuUtilization?

The "metrics" tab in the AML web portal shows a chart of the GpuUtilization over a specified time period, along with the average GpuUtilization for that time period. However, I have found that average GpuUtilization does not appear to accurately reflect the data shown in the chart for some of my organization's AML workspaces.

For example, the following screenshot shows the GpuUtilization for July 1-31, with the average GpuUtilization reported as 54.06. This is clearly much higher than what is shown in the chart. When I download the data from the chart (Share -> Download to Excel), I compute the average GpuUtilization to be ~11% in Excel. Why is there such a discrepancy?

enter image description here

I have found similar discrepancies for other AML workspaces as well. However, the average GpuUtilization appears to be more accurate for the August 1-25 time period than it is for July 1-31. I wish to better understand how AML computes the average GpuUtilization over a time period so we can accurately account for my organization's AML GPU usage on a per-workspace basis.


Solution

  • The 54.06 is likely the average over time when GPU VM was allocated. If the VM gets deallocated, the Azure Monitor gets no data. These missing values get interpolated as zeros on the chart.

    To get a better estimate of utilization, you could check when the VM was stopped, and exclude that time interval from the average.