I am training an image segmentation model on azure ML pipeline. During the testing step, I'm saving the output of the model to the associated blob storage. Then I want to find the IOU (Intersection over Union) between the calculated output and the ground truth. Both of these set of images lie on the blob storage. However, IOU calculation is extremely slow, and I think it's disk bound. In my IOU calculation code, I'm just loading the two images (commented out other code), still, it's taking close to 6 seconds per iteration, while training and testing were fast enough.
Is this behavior normal? How do I debug this step?
A few notes on the drives that an AzureML remote run has available:
Here is what I see when I run df
on a remote run (in this one, I am using a blob Datastore
via as_mount()
):
Filesystem 1K-blocks Used Available Use% Mounted on
overlay 103080160 11530364 86290588 12% /
tmpfs 65536 0 65536 0% /dev
tmpfs 3568556 0 3568556 0% /sys/fs/cgroup
/dev/sdb1 103080160 11530364 86290588 12% /etc/hosts
shm 2097152 0 2097152 0% /dev/shm
//danielscstorageezoh...-620830f140ab 5368709120 3702848 5365006272 1% /mnt/batch/tasks/.../workspacefilestore
blobfuse 103080160 11530364 86290588 12% /mnt/batch/tasks/.../workspaceblobstore
The interesting items are overlay
, /dev/sdb1
, //danielscstorageezoh...-620830f140ab
and blobfuse
:
overlay
and /dev/sdb1
are both the mount of the local SSD on the machine (I am using a STANDARD_D2_V2 which has a 100GB SSD).//danielscstorageezoh...-620830f140ab
is the mount of the Azure File Share that contains the project files (your script, etc.). It is also the current working directory for your run.blobfuse
is the blob store that I had requested to mount in the Estimator
as I executed the run.I was curious about the performance differences between these 3 types of drives. My mini benchmark was to download and extract this file: http://download.tensorflow.org/example_images/flower_photos.tgz (it is a 220 MB tar file that contains about 3600 jpeg images of flowers).
Here the results:
Filesystem/Drive Download_and_save Extract
Local_SSD 2s 2s
Azure File Share 9s 386s
Premium File Share 10s 120s
Blobfuse 10s 133s
Blobfuse w/ Premium Blob 8s 121s
In summary, writing small files is much, much slower on the network drives, so it is highly recommended to use /tmp or Python tempfile
if you are writing smaller files.
For reference, here the script I ran to measure: https://gist.github.com/danielsc/9f062da5e66421d48ac5ed84aabf8535
And this is how I ran it: https://gist.github.com/danielsc/6273a43c9b1790d82216bdaea6e10e5c