Search code examples
arraysbashshellloopsvcf-variant-call-format

Running an array on multiple files with different output files linux


I would like to breakdown 8 files (representing a chromosome each) that each have roughly 2e9 lines into roughly 5 chunks of 4e8 lines. These are VCF files (https://en.wikipedia.org/wiki/Variant_Call_Format) which have a header and then the genetic variants so I need to preserve the headers for each and reattach them to the chromosome specific header. I am doing this in linux on a HPC.

I have done this with a single file before using:

#grab the header
head -n 10000 my.vcf | grep "^#" >header
#grab the non header lines
grep -v "^#" my.vcf >variants
#split into chunks with 40000000 lines
split -l 40000000 variants
#reattach the header to each and clean up
for i in x*;do cat header $i >$i.vcf && rm -f $i;done
rm -f header variants

I could do this manually with all 8 chromosome but I work within a HPC with array capabilities and feel like this could be done better with a for loop however, the syntax is slightly confusing to me.

I have tried:

#filelist is a list of the 8 chromosome files i.e. chr001.vcf, chr002.vcf...chr0008.vcf 
for f in 'cat filelist.txt'; head -n 10000 my.vcf | grep "^#" >header; done

This puts everything into the same header. How would I put the outputs into unique per chromosome headers? Similarly how would this then work with splitting the variants and reattaching the headers to each chunk per chromosome?

The desired output would be:

chr001_chunk1.vcf
chr001_chunk2.vcf
chr001_chunk3.vcf
chr001_chunk4.vcf
chr001_chunk5.vcf
...
chr008_chunk5.vcf 

with each vcf chunk having the header from their respective chromosome"parent".

Many thanks


Solution

  • #!/bin/bash
    
    #
    # scan the current directory for chr[0-9]*.vcf
    # extract header lines (^#)
    # extract variants (non-header lines) and split to 40m partial files
    # combine header with each partial file
    #
    
    # for tuning
    lines=40000000
    
    vcf_list=(chr[0-9]*.vcf)
    if [ ${#vcf_list} -eq 0 ]; then
        echo no .vcf files
        exit 1
    fi
    
    tmpv=variants
    hdr=header
    
    for chrfile in "${vcf_list[@]}"; do
        # isolate without . extn
        base=${chrfile%%.*}
        echo $chrfile
    
        # extract header lines
        head -1000 $chrfile | grep "^#" > $hdr
    
        # extract variants
        grep -v "^#" $chrfile > $tmpv
    
        #
        # split variants into files with max $lines;
        # output files are created with a filter to combine header data and
        # partial variant data in 1 pass, avoiding additional file I/O;
        # output files are named with a leading 'p' to support multiple
        # runs without filename collision
        #
        split -d -l $lines $tmpv p${base}_chunk --additional-suffix=.vcf \
            --filter="cat $hdr - > \$FILE; echo \"  \$FILE\""
    done
    
    rm -f $tmpv $hdr
    
    exit 0