Search code examples
splitparallel-processingfastagnu-parallelfastq

Split Fasta file with parallel


I have a long Fasta file (from a processed Fastq file) which I need to split into smaller files. I'm trying to do that using parallel, but I'm not sure how to. I need to split a Fasta file into smaller files, at a maximum size of 700Mb each. My goal is to save each file as {n}.faa, with sequencial numbers for n (01.faa, 02.faa, 03.faa, 04.faa ... 10.faa, 11.faa, 12.faa ... n.faa), each with approximately the same sizes. As far as I read about parallel, I'm thinking this is what I should use:

parallel -a MyFile.fasta --block 700M --pipe-part --recend '\n' --recstart '>' {} > #.faa

I have two questions here, though. First, how do I make it so each file has approximately the same size, so that for an initial file of 1500Mb I won't end with two 700Mb file and one 100Mb file, but three 500Mb file? One idea I had for this was to calculate the number of files needed (original file size / 700, rounding it up), then divide the original size by the number of files needed, and, as the number has been rounded up to an integer, I'd have the block size (less than 700Mb), so I'd be doing something like this:

original_size=$(wc -c MyFile.fasta | sed 's/ .*//')  #get size
number=`expr $original_size / 734003200 + 1`
size=`expr $(wc -c MyFile.fasta) / $number`

With this I can get the size (in bytes) each file should have, to get files with the same size, but all under 700Mb. Is there a simpler way to do this using only the parallel tool?

And what should I do to have each block saved in different files? Is the "{} > #.faa" approach correct? Or would using --cat be more efficient? Also, how do I get the number "#" to be two digits each (01...10)?

Also, if it must split the file into 12 smaller files, but my computer only have 8 cores, will it use 8 cores for 8 files and as it finishes jobs run the remaining 4 jobs, or should I somehow limit my jobs to number of cores?


Solution

  • This will split into files that are around 700 MB (+- the size of a single record). They will be named 1..n:

    parallel -a MyFile.fasta --block 700M --pipe-part --recend '\n' --recstart '>' "cat >{#}"
    

    You cannot easily ask GNU Parallel to make 1500 MB into 3x500 MB instead of 2x700 MB + 1x100MB.

    But you can make GNU Parallel split the file into a given number of files per jobslot. One file per jobslot:

    parallel -a MyFile.fasta --block -1 --pipe-part --recend '\n' --recstart '>' "cat >{#}"
    

    Two files per jobslot:

    parallel -a MyFile.fasta --block -2 --pipe-part --recend '\n' --recstart '>' "cat >{#}"
    

    Number of jobslots is given with -j and defaults to the number of CPU cores.

    To get the names be 01..n you have to pad them with zeros. GNU Parallel does not offer that out of the box, but in the man page in the section on --rpl you are given an example of how to define a replacement string to do just that:

    --rpl '{0#} $f=1+int("".(log(total_jobs())/log(10)));  $_=sprintf("%0${f}d",seq())'
    

    So:

    parallel -a MyFile.fasta --block 700M --pipe-part --recend '\n' --recstart '>' \
      --rpl '{0#} $f=1+int("".(log(total_jobs())/log(10)));  $_=sprintf("%0${f}d",seq())' \
      "cat >{0#}"
    

    If you use {0#} often, you can simply put the --rpl definition in ~/.parallel/config.