python-3.x bash gnu temporary-files gnu-parallel

GNU Parallel Memory Leak when Running Python Script

I have a python script that uses a function to temporarily downloads files from a bucket, transforms them into an ndarray, and concludes by saving it (final size ~ 10GB) to another bucket.

I need to run this script ~200 times, so I created an sh file, run_reshape.sh, to parallelize the runs that follows this layout:

#!/bin/sh
python3 reshape.py 'group_1'
python3 reshape.py 'group_2'
...

I have been trying to parallelize these runs using GNU Parallel in the following way:

parallel --jobs 6 --tmpdir scratch/tmp --cleanup < run_reshape.sh

After 2-3 successful runs of the .py script on different cores, I get the following error from GNU Parallel:

parallel: Error: Output is incomplete. Cannot append to buffer file in $TMPDIR. Is the disk full?
parallel: Error: Change $TMPDIR with --tmpdir or use --compress.

I'm not sure how the disk could be full. When I check free -m after parallel throws the error, I have >120GB of available space on disk.

I have checked both .parallel/tmp/ and scratch/tmp/. scratch/tmp/ is empty and .parallel/tmp/ has a 6 byte file in it. Also all variables within the python script are located inside a function that is called without its own variable assignment. As an extra precaution, I also delete them and call gc.collect() at the conclusion of reshape.py.

Any help with this is greatly appreciated!

Extra Info

In case it's helpful here is the basic outline of reshape.py:

# Define reshape function
def reshape_images(arg):
    x_len = 1000

    new_shape = np.empty((x_len, 2048, 2048), dtype=(np.float16))
    new_shape[:] = np.nan

    for n in range(x_len):              
        with gcs_file_system.open(arg+str([n])+'.jpg') as file:
            im = Image.open(file)
            np_im = np.array(im, dtype='np.float16')
            new_shape[n]=np_im
            del im
            del np_im

    save_string = f'{arg}.npy'
    np.save(file_io.FileIO(f'{save_string}', 'w'), new_shape)
    del new_shape

# Run reshape function
reshape_images(sys.argv[1])

# Clear memory of namespace variables
gc.collect()

Solution

I'm not sure how the disk could be full. When I check free -m after parallel throws the error, I have >120GB of available space on disk.

You need to do df scratch/tmp before GNU Parallel stops.

GNU Parallel opens temporary files in --tmpdir, removes them immediately, but keeps them open. This is to avoid that files need to be cleaned up if GNU Parallel is killed.

You will most likely discover a situation where:

scratch/tmp is full
there are no files in scratch/tmp

But as soon as GNU Parallel ends, the space will be free.

So if you only look at df after GNU Parallel has finished, you will not be looking at the time when the disk is full.

In other words: What you see is a 100% normal behaviour when scratch/tmp is too small.

Try setting --tmpdir to a dir with more available space.

Or try:

seq 100000000 | parallel -uj1 -N0 df scratch/tmp

while running your jobs and see the disk fill up.