arrays bash shell loops vcf-variant-call-format

Running an array on multiple files with different output files linux

I would like to breakdown 8 files (representing a chromosome each) that each have roughly 2e9 lines into roughly 5 chunks of 4e8 lines. These are VCF files (https://en.wikipedia.org/wiki/Variant_Call_Format) which have a header and then the genetic variants so I need to preserve the headers for each and reattach them to the chromosome specific header. I am doing this in linux on a HPC.

I have done this with a single file before using:

#grab the header
head -n 10000 my.vcf | grep "^#" >header
#grab the non header lines
grep -v "^#" my.vcf >variants
#split into chunks with 40000000 lines
split -l 40000000 variants
#reattach the header to each and clean up
for i in x*;do cat header $i >$i.vcf && rm -f $i;done
rm -f header variants

I could do this manually with all 8 chromosome but I work within a HPC with array capabilities and feel like this could be done better with a for loop however, the syntax is slightly confusing to me.

I have tried:

#filelist is a list of the 8 chromosome files i.e. chr001.vcf, chr002.vcf...chr0008.vcf 
for f in 'cat filelist.txt'; head -n 10000 my.vcf | grep "^#" >header; done

This puts everything into the same header. How would I put the outputs into unique per chromosome headers? Similarly how would this then work with splitting the variants and reattaching the headers to each chunk per chromosome?

The desired output would be:

chr001_chunk1.vcf
chr001_chunk2.vcf
chr001_chunk3.vcf
chr001_chunk4.vcf
chr001_chunk5.vcf
...
chr008_chunk5.vcf

with each vcf chunk having the header from their respective chromosome"parent".

Many thanks

Solution

#!/bin/bash

#
# scan the current directory for chr[0-9]*.vcf
# extract header lines (^#)
# extract variants (non-header lines) and split to 40m partial files
# combine header with each partial file
#

# for tuning
lines=40000000

vcf_list=(chr[0-9]*.vcf)
if [ ${#vcf_list} -eq 0 ]; then
    echo no .vcf files
    exit 1
fi

tmpv=variants
hdr=header

for chrfile in "${vcf_list[@]}"; do
    # isolate without . extn
    base=${chrfile%%.*}
    echo $chrfile

    # extract header lines
    head -1000 $chrfile | grep "^#" > $hdr

    # extract variants
    grep -v "^#" $chrfile > $tmpv

    #
    # split variants into files with max $lines;
    # output files are created with a filter to combine header data and
    # partial variant data in 1 pass, avoiding additional file I/O;
    # output files are named with a leading 'p' to support multiple
    # runs without filename collision
    #
    split -d -l $lines $tmpv p${base}_chunk --additional-suffix=.vcf \
        --filter="cat $hdr - > \$FILE; echo \"  \$FILE\""
done

rm -f $tmpv $hdr

exit 0