Search code examples
pytorchresuming-training

How to fix Pytorch RuntimeError: [enforce fail at inline_container.cc:588] . PytorchStreamWriter failed writing file data/17: file write failed


I've hit this error RuntimeError: [enforce fail at inline_container.cc:588] PytorchStreamWriter failed writing file data/17: file write failed in a training sess that's been fine for several hours - full stacktrace is:

Traceback (most recent call last):
  File "/opt/conda/bin/stylegan2_pytorch", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/stylegan2_pytorch/cli.py", line 190, in main
    fire.Fire(train_from_folder)
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/stylegan2_pytorch/cli.py", line 184, in train_from_folder
    mp.spawn(run_training,
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 163, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 619, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record)
  File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 853, in _save
    zip_file.write_record(name, storage.data_ptr(), num_bytes)
RuntimeError: [enforce fail at inline_container.cc:588] . PytorchStreamWriter failed writing file data/17: file write failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/opt/conda/lib/python3.10/site-packages/stylegan2_pytorch/cli.py", line 60, in run_training
    retry_call(model.train, tries=3, exceptions=NanException)
  File "/opt/conda/lib/python3.10/site-packages/retry/api.py", line 101, in retry_call
    return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter, logger)
  File "/opt/conda/lib/python3.10/site-packages/retry/api.py", line 33, in __retry_internal
    return f()
  File "/opt/conda/lib/python3.10/site-packages/stylegan2_pytorch/stylegan2_pytorch.py", line 1147, in train
    self.save(self.checkpoint_num)
  File "/opt/conda/lib/python3.10/site-packages/stylegan2_pytorch/stylegan2_pytorch.py", line 1368, in save
    torch.save(save_data, self.model_name(num))
  File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 618, in save
    with _open_zipfile_writer(f) as opened_zipfile:
  File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 466, in __exit__
    self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:424] . unexpected pos 79331008 vs 79330896

/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

The last saved file is of different size than the last, if I clear that and restart the error recurs. Any help appreciated.


Solution

  • This turns out to be pytorch-speak for disk full. I guess in hindsight it isn't so unclear. I'm not sure if the reason for the failed file write can be bubbled up into the error which would have made this much less hairy, will see if I can pry into it.