Search code examples
tensorflowazure-machine-learning-service

Disk I/O extremely slow on P100-NC6s-V2


I am training an image segmentation model on azure ML pipeline. During the testing step, I'm saving the output of the model to the associated blob storage. Then I want to find the IOU (Intersection over Union) between the calculated output and the ground truth. Both of these set of images lie on the blob storage. However, IOU calculation is extremely slow, and I think it's disk bound. In my IOU calculation code, I'm just loading the two images (commented out other code), still, it's taking close to 6 seconds per iteration, while training and testing were fast enough.

Is this behavior normal? How do I debug this step?


Solution

  • A few notes on the drives that an AzureML remote run has available:

    Here is what I see when I run df on a remote run (in this one, I am using a blob Datastore via as_mount()):

    Filesystem                             1K-blocks     Used  Available Use% Mounted on
    overlay                                103080160 11530364   86290588  12% /
    tmpfs                                      65536        0      65536   0% /dev
    tmpfs                                    3568556        0    3568556   0% /sys/fs/cgroup
    /dev/sdb1                              103080160 11530364   86290588  12% /etc/hosts
    shm                                      2097152        0    2097152   0% /dev/shm
    //danielscstorageezoh...-620830f140ab 5368709120  3702848 5365006272   1% /mnt/batch/tasks/.../workspacefilestore
    blobfuse                               103080160 11530364   86290588  12% /mnt/batch/tasks/.../workspaceblobstore
    

    The interesting items are overlay, /dev/sdb1, //danielscstorageezoh...-620830f140ab and blobfuse:

    1. overlay and /dev/sdb1 are both the mount of the local SSD on the machine (I am using a STANDARD_D2_V2 which has a 100GB SSD).
    2. //danielscstorageezoh...-620830f140ab is the mount of the Azure File Share that contains the project files (your script, etc.). It is also the current working directory for your run.
    3. blobfuse is the blob store that I had requested to mount in the Estimator as I executed the run.

    I was curious about the performance differences between these 3 types of drives. My mini benchmark was to download and extract this file: http://download.tensorflow.org/example_images/flower_photos.tgz (it is a 220 MB tar file that contains about 3600 jpeg images of flowers).

    Here the results:

    Filesystem/Drive         Download_and_save       Extract
    Local_SSD                               2s            2s  
    Azure File Share                        9s          386s
    Premium File Share                     10s          120s
    Blobfuse                               10s          133s
    Blobfuse w/ Premium Blob                8s          121s
    

    In summary, writing small files is much, much slower on the network drives, so it is highly recommended to use /tmp or Python tempfile if you are writing smaller files.

    For reference, here the script I ran to measure: https://gist.github.com/danielsc/9f062da5e66421d48ac5ed84aabf8535

    And this is how I ran it: https://gist.github.com/danielsc/6273a43c9b1790d82216bdaea6e10e5c