I would like to breakdown 8 files (representing a chromosome each) that each have roughly 2e9 lines into roughly 5 chunks of 4e8 lines. These are VCF files (https://en.wikipedia.org/wiki/Variant_Call_Format) which have a header and then the genetic variants so I need to preserve the headers for each and reattach them to the chromosome specific header. I am doing this in linux on a HPC.
I have done this with a single file before using:
#grab the header
head -n 10000 my.vcf | grep "^#" >header
#grab the non header lines
grep -v "^#" my.vcf >variants
#split into chunks with 40000000 lines
split -l 40000000 variants
#reattach the header to each and clean up
for i in x*;do cat header $i >$i.vcf && rm -f $i;done
rm -f header variants
I could do this manually with all 8 chromosome but I work within a HPC with array capabilities and feel like this could be done better with a for loop however, the syntax is slightly confusing to me.
I have tried:
#filelist is a list of the 8 chromosome files i.e. chr001.vcf, chr002.vcf...chr0008.vcf
for f in 'cat filelist.txt'; head -n 10000 my.vcf | grep "^#" >header; done
This puts everything into the same header. How would I put the outputs into unique per chromosome headers? Similarly how would this then work with splitting the variants and reattaching the headers to each chunk per chromosome?
The desired output would be:
chr001_chunk1.vcf
chr001_chunk2.vcf
chr001_chunk3.vcf
chr001_chunk4.vcf
chr001_chunk5.vcf
...
chr008_chunk5.vcf
with each vcf chunk having the header from their respective chromosome"parent".
Many thanks
#!/bin/bash
#
# scan the current directory for chr[0-9]*.vcf
# extract header lines (^#)
# extract variants (non-header lines) and split to 40m partial files
# combine header with each partial file
#
# for tuning
lines=40000000
vcf_list=(chr[0-9]*.vcf)
if [ ${#vcf_list} -eq 0 ]; then
echo no .vcf files
exit 1
fi
tmpv=variants
hdr=header
for chrfile in "${vcf_list[@]}"; do
# isolate without . extn
base=${chrfile%%.*}
echo $chrfile
# extract header lines
head -1000 $chrfile | grep "^#" > $hdr
# extract variants
grep -v "^#" $chrfile > $tmpv
#
# split variants into files with max $lines;
# output files are created with a filter to combine header data and
# partial variant data in 1 pass, avoiding additional file I/O;
# output files are named with a leading 'p' to support multiple
# runs without filename collision
#
split -d -l $lines $tmpv p${base}_chunk --additional-suffix=.vcf \
--filter="cat $hdr - > \$FILE; echo \" \$FILE\""
done
rm -f $tmpv $hdr
exit 0