I have several VCF files of different individuals that are zipped ( .vcf.gz). I want to merge these files into one VCF file containing all individuals by using vcf-merge.
However, this requires the files to be actually zipped with bgzip and indexed with tabix. Does anyone know if I can go from a .vcf.gz file to a tabix indexed and bgzip file without uncompressing it first (it takes a lot of storage, the files are really big).
If I understand correctly, you have:
file.vcf.gz
which is a gzip compressed VCF file (not block-gzip compressed)and you would like:
file.vcf.bgz
which is a block-gzip compressed VCF file with the same contents as file.vcf.gz
, andfile.vcf.bgz.tbi
which is a tabix index for file.vcf.bgz
and you would like to do this conversion without uncompressing it.
Unfortunately, I'm not aware of anyway to avoid uncompressing the data in order to recompress it in blocked form. You can keep your memory costs constant by streaming the data:
gzip --decompress --to-stdout file.vcf.gz \
| bgzip --index --index-name file.vcf.bgz.tbi -@4 \
> file.vcf.bgz
The first line decompresses file.vcf.gz
, writing the decompressed output to the standard output stream. The second line block-gzip compresses the standard input stream (writing the compressed data to the standard output stream) and produces an index file called file.vcf.bgz.tbi
. The -@4
tells bgzip
to use four threads. You can increase this if your machine has more cores. The last line directs the block-gzip compressed output to a file called file.vcf.bgz
.
On my MacBook Pro this process took one minute to re-compress a 214MB file.
NB: This will not delete file.vcf.gz
, you'll need to delete that yourself if you no longer want it.