Can I use tabix on vcf.gz files?

I have several VCF files of different individuals that are zipped ( .vcf.gz). I want to merge these files into one VCF file containing all individuals by using vcf-merge.

However, this requires the files to be actually zipped with bgzip and indexed with tabix. Does anyone know if I can go from a .vcf.gz file to a tabix indexed and bgzip file without uncompressing it first (it takes a lot of storage, the files are really big).

Solution

If I understand correctly, you have:

file.vcf.gz which is a gzip compressed VCF file (not block-gzip compressed)

and you would like:

file.vcf.bgz which is a block-gzip compressed VCF file with the same contents as file.vcf.gz, and
file.vcf.bgz.tbi which is a tabix index for file.vcf.bgz

and you would like to do this conversion without uncompressing it.

Unfortunately, I'm not aware of anyway to avoid uncompressing the data in order to recompress it in blocked form. You can keep your memory costs constant by streaming the data:

gzip --decompress --to-stdout file.vcf.gz \
  | bgzip --index --index-name file.vcf.bgz.tbi -@4 \
  > file.vcf.bgz

The first line decompresses file.vcf.gz, writing the decompressed output to the standard output stream. The second line block-gzip compresses the standard input stream (writing the compressed data to the standard output stream) and produces an index file called file.vcf.bgz.tbi. The -@4 tells bgzip to use four threads. You can increase this if your machine has more cores. The last line directs the block-gzip compressed output to a file called file.vcf.bgz.

On my MacBook Pro this process took one minute to re-compress a 214MB file.

NB: This will not delete file.vcf.gz, you'll need to delete that yourself if you no longer want it.