Search code examples
bashgnu-parallelgenomexdgutils

parallel: Error: $XDG_CACHE_HOME can only contain [-a-z0-9_+,.%:/= ]


I am running (or trying to run) 3DDNA on CU's supercomputing cluster Alpine to assemble a genome from long read and short read/contact data (PacBio HIFI and Arima HIC). 3DDNA uses GNU Parallel to parallelize several steps in the assembly process. GNU parallel appears to use XDG base directory specification. I have had issues running it because it seems the $TMPDIR and $XDG_CACHE_HOME variables are incorrectly defined. I have defined both in .bashrc and .bash_profile as such:

export TMPDIR=/scratch/alpine/.colostate.edu/username/463/juicedir/tmp
export XDG_CACHE_HOME=/scratch/alpine/.colostate.edu/username/463/juicedir/cache

When I submit the job, it runs for ~25 seconds and I get this output:

###############
Starting iterating scaffolding with editing:
...starting round 0 of scaffolding:
:) -p flag was triggered. Running LIGer with GNU Parallel support parameter set to true.
:) -s flag was triggered, starting calculations with 15000 threshold starting contig/scaffold size
:) -q flag was triggered, starting calculations with 1 threshold mapping quality
...Using cprops file: 463_scaffolds.0.cprops
...Using merged_nodups file: 463_scaffolds.mnd.0.txt
...Scaffolding all scaffolds and contigs greater or equal to 15000 bp.
...Starting iteration # 1
parallel: Error: $XDG_CACHE_HOME can only contain [-a-z0-9_+,.%:/= ].
:) DONE!
...visualizing round 0 results:
:) -p flag was triggered. Running with GNU Parallel support parameter set to true.
:) -q flag was triggered, starting calculations for 1 threshold mapping quality
:) -i flag was triggered, building mapq without
:) -c flag was triggered, will remove temporary files after completion
...Remapping contact data from the original contig set to assembly
parallel: Error: $XDG_CACHE_HOME can only contain [-a-z0-9_+,.%:/= ].

This style of output continues; the program essentially runs with empty files that it creates, and the only error I can identify is

parallel: Error: $XDG_CACHE_HOME can only contain [-a-z0-9_+,.%:/= ].

I can't find a similar error reported elsewhere. Other background info is that originally I was getting

parallel: Error: $TMPDIR can only contain [-a-z0-9_+,.%:/= ].

and I went into each individual .sh file that the program calls and defined $TMPDIR with the --tmpdir flag in every GNU parallel command.

The last thing I tried was create $HOME/.cache as a symlink to my desired cache folder in scratch storage. Didn't work.

Any ideas or experience greatly appreciated.


Solution

  • after an existential crisis and giving up for a month

    And that was excellent, because in the meantime version 20250122 has been released.

    Instead of stopping, GNU Parallel now gives a warning if the variables end in \r. It then removes \r and runs as expected:

    parallel: Warning: Removed ^M (<CR>) from end of TMPDIR
    parallel: Warning: Removed ^M (<CR>) from end of XDG_CACHE_HOME
    

    So try upgrading to 20250122.

    Then you might not have to do the cleanup of \r below.

    The problem arises because you have made your .sh-files in a program that uses \r\n (carriage return aka. aka. ^M, newline) instead of \n (newline) at the end of each line. This is typical for Microsoft Windows programs.

    This:

    perl -i.bak -pe 's/\r//' myscript.sh
    

    removes \r from myscript.sh. So it should be run on myscript.sh before you upload myscript.sh to the cluster. In other words: It should not be put into the scripts you upload.

    If you have many .sh files you can do the cleanup of all the .sh files by running this on your laptop:

    parallel -q perl -i.bak -pe 's/\r//' ::: *.sh