I've hit this error RuntimeError: [enforce fail at inline_container.cc:588] PytorchStreamWriter failed writing file data/17: file write failed
in a training sess that's been fine for several hours - full stacktrace is:
Traceback (most recent call last):
File "/opt/conda/bin/stylegan2_pytorch", line 8, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/stylegan2_pytorch/cli.py", line 190, in main
fire.Fire(train_from_folder)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/stylegan2_pytorch/cli.py", line 184, in train_from_folder
mp.spawn(run_training,
File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 246, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
while not context.join():
File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 163, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 619, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record)
File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 853, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
RuntimeError: [enforce fail at inline_container.cc:588] . PytorchStreamWriter failed writing file data/17: file write failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
fn(i, *args)
File "/opt/conda/lib/python3.10/site-packages/stylegan2_pytorch/cli.py", line 60, in run_training
retry_call(model.train, tries=3, exceptions=NanException)
File "/opt/conda/lib/python3.10/site-packages/retry/api.py", line 101, in retry_call
return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter, logger)
File "/opt/conda/lib/python3.10/site-packages/retry/api.py", line 33, in __retry_internal
return f()
File "/opt/conda/lib/python3.10/site-packages/stylegan2_pytorch/stylegan2_pytorch.py", line 1147, in train
self.save(self.checkpoint_num)
File "/opt/conda/lib/python3.10/site-packages/stylegan2_pytorch/stylegan2_pytorch.py", line 1368, in save
torch.save(save_data, self.model_name(num))
File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 618, in save
with _open_zipfile_writer(f) as opened_zipfile:
File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 466, in __exit__
self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:424] . unexpected pos 79331008 vs 79330896
/opt/conda/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
The last saved file is of different size than the last, if I clear that and restart the error recurs. Any help appreciated.
This turns out to be pytorch-speak for disk full. I guess in hindsight it isn't so unclear. I'm not sure if the reason for the failed file write can be bubbled up into the error which would have made this much less hairy, will see if I can pry into it.