Search code examples
azure-machine-learning-service

Out of disk space


when running an AML pipeline on AML compute, I get this kind of error :

I can try rebooting the cluster, but that may not fix the problem (if storage gets accumulated no the nodes, that should be cleaned.

Session ID: 933fc468-7a22-425d-aa1b-94eba5784faa
{"error":{"code":"ServiceError","message":"Job preparation failed: [Errno 28] No space left on device","detailsUri":null,"target":null,"details":[],"innerError":null,"debugInfo":{"type":"OSError","message":"[Errno 28] No space left on device","stackTrace":" File \"/mnt/batch/tasks/shared/LS_root/jobs/jj2/azureml/piperun-20190911_1568231788841835_1/mounts/workspacefilestore/azureml/PipeRun-20190911_1568231788841835_1-setup/job_prep.py\", line 126, in <module>\n invoke()\n File \"/mnt/batch/tasks/shared/LS_root/jobs/jj2/azureml/piperun-20190911_1568231788841835_1/mounts/workspacefilestore/azureml/PipeRun-20190911_1568231788841835_1-setup/job_prep.py\", line 97, in invoke\n extract_project(project_dir, options.project_zip, options.snapshots)\n File \"/mnt/batch/tasks/shared/LS_root/jobs/jj2/azureml/piperun-20190911_1568231788841835_1/mounts/workspacefilestore/azureml/PipeRun-20190911_1568231788841835_1-setup/job_prep.py\", line 60, in extract_project\n project_fetcher.fetch_project_snapshot(snapshot[\"Id\"], snapshot[\"PathStack\"])\n File \"/mnt/batch/tasks/shared/LS_root/jobs/jj2/azureml/piperun-20190911_1568231788841835_1/mounts/workspacefilestore/azureml/PipeRun-20190911_1568231788841835_1/azureml-setup/project_fetcher.py\", line 72, in fetch_project_snapshot\n _download_tree(sas_tree, path_stack)\n File \"/mnt/batch/tasks/shared/LS_root/jobs/jj2/azureml/piperun-20190911_1568231788841835_1/mounts/workspacefilestore/azureml/PipeRun-20190911_1568231788841835_1/azureml-setup/project_fetcher.py\", line 106, in _download_tree\n _download_tree(child, path_stack)\n File \"/mnt/batch/tasks/shared/LS_root/jobs/jj2/azureml/piperun-20190911_1568231788841835_1/mounts/workspacefilestore/azureml/PipeRun-20190911_1568231788841835_1/azureml-setup/project_fetcher.py\", line 106, in _download_tree\n _download_tree(child, path_stack)\n File \"/mnt/batch/tasks/shared/LS_root/jobs/jj2/azureml/piperun-20190911_1568231788841835_1/mounts/workspacefilestore/azureml/PipeRun-20190911_1568231788841835_1/azureml-setup/project_fetcher.py\", line 98, in _download_tree\n fh.write(response.read())\n","innerException":null,"data":null,"errorResponse":null}},"correlation":null,"environment":null,"location":null,"time":"0001-01-01T00:00:00+00:00"}

I would expect the job to run as it should. And in fact, I've checked on the node and the node do have lots of available harddrive space :

root@4f57957ac829466a86bad4d4dc51fadd000001:~# df -kh                                                                                               Filesystem      Size  Used Avail Use% Mounted on
udev             28G     0   28G   0% /dev
tmpfs           5.6G  9.0M  5.5G   1% /run
/dev/sda1       125G  2.8G  122G   3% /
tmpfs            28G     0   28G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            28G     0   28G   0% /sys/fs/cgroup
/dev/sdb1       335G  6.7G  311G   3% /mnt
tmpfs           5.6G     0  5.6G   0% /run/user/1002

Suggestions on what I should check?


Solution

  • Seems like you've run into Azure file share constraints. You can use the following sample code to change your runs to use blob storage which can scale to large number of jobs running in parallel:

    https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data#accessing-source-code-during-training