I have a python script that uses a function to temporarily downloads files from a bucket, transforms them into an ndarray
, and concludes by saving it (final size ~ 10GB) to another bucket.
I need to run this script ~200 times, so I created an sh file, run_reshape.sh
, to parallelize the runs that follows this layout:
#!/bin/sh
python3 reshape.py 'group_1'
python3 reshape.py 'group_2'
...
I have been trying to parallelize these runs using GNU Parallel
in the following way:
parallel --jobs 6 --tmpdir scratch/tmp --cleanup < run_reshape.sh
After 2-3 successful runs of the .py script on different cores, I get the following error from GNU Parallel
:
parallel: Error: Output is incomplete. Cannot append to buffer file in $TMPDIR. Is the disk full?
parallel: Error: Change $TMPDIR with --tmpdir or use --compress.
I'm not sure how the disk could be full. When I check free -m
after parallel
throws the error, I have >120GB of available space on disk.
I have checked both .parallel/tmp/
and scratch/tmp/
. scratch/tmp/
is empty and .parallel/tmp/
has a 6 byte file in it. Also all variables within the python script are located inside a function that is called without its own variable assignment. As an extra precaution, I also delete them and call gc.collect()
at the conclusion of reshape.py
.
Any help with this is greatly appreciated!
Extra Info
In case it's helpful here is the basic outline of reshape.py
:
# Define reshape function
def reshape_images(arg):
x_len = 1000
new_shape = np.empty((x_len, 2048, 2048), dtype=(np.float16))
new_shape[:] = np.nan
for n in range(x_len):
with gcs_file_system.open(arg+str([n])+'.jpg') as file:
im = Image.open(file)
np_im = np.array(im, dtype='np.float16')
new_shape[n]=np_im
del im
del np_im
save_string = f'{arg}.npy'
np.save(file_io.FileIO(f'{save_string}', 'w'), new_shape)
del new_shape
# Run reshape function
reshape_images(sys.argv[1])
# Clear memory of namespace variables
gc.collect()
I'm not sure how the disk could be full. When I check free -m after parallel throws the error, I have >120GB of available space on disk.
You need to do df scratch/tmp
before GNU Parallel stops.
GNU Parallel opens temporary files in --tmpdir
, removes them immediately, but keeps them open. This is to avoid that files need to be cleaned up if GNU Parallel is killed.
You will most likely discover a situation where:
But as soon as GNU Parallel ends, the space will be free.
So if you only look at df
after GNU Parallel has finished, you will not be looking at the time when the disk is full.
In other words: What you see is a 100% normal behaviour when scratch/tmp
is too small.
Try setting --tmpdir
to a dir with more available space.
Or try:
seq 100000000 | parallel -uj1 -N0 df scratch/tmp
while running your jobs and see the disk fill up.