Search code examples
bashloopsfor-loopsplitvcf-variant-call-format

Splitting multiple input files into multiple outputs using split function in linux


I have 8 files I would like to split into 5 chunks per file. I would normally do this individually but would like to run this as a loop. I work within a HPC.

I have created a list of the file names and labelled it "variantlist.txt". My code is:

for f in 'cat variantlist.txt'; do split ${f} -n 5 -d; done 

However, it only splits the final file in the variantlist.txt file outputting 5 chunks from the final entry only.

Even if I list the files individually:

for f in chr001.vcf chr002 ...chr008.vcf ; do split ${f} -n 5 -d; done

It still only splits the final file into 5 chunks.

Not sure where I am going wrong here. The desired output would be 40 chunks, 5 per chromosome. Your help would be greatly appreciated.

Many thanks


Solution

  • The split is creating the same set of files each time and overwriting the previous ones. Here's one way to handle that -

    for f in $(<variantlist.txt)  # don't use cat
    do  mkdir -p $f.split         # make a subdir for the files
        ( cd $f.split &&          # change into the subdir only in a subshell
          split ../$f -n 5 -d     # split from there
        )                         # close the subshell, parent still in base dir
    done
    

    Or you could just do this -

    while read f             # grab each filename
    do split $f -n 5 -d      # split it
       for x in x??          # for each split file
       do mv $x $f.$x        # rename it to include the parent file name
       done
    done < variantlist.txt   # take names from this file
    

    This is a lot slower, but doesn't use subdirs.

    My favorite, though -

    xargs -I {} split {} -n 5 -d {} < variantlist.txt
    

    The last arg becomes the PREFIX for split instead of the default of x.

    EDIT -- with 2 billion lines per file, use this one:

    for f in $(<variantlist.txt)
    do split "$f" -d -n 5 "$f" & # run all in background at the same time
    done