Search code examples
verificationdata-integritygunzip

verify gunzip decompression


I am working with large datasets that I have concatenated using: cat file1.fasta.gz file2.fasta.gz > newfile.fasta.gz

Then I unzip newfile using: gunzip newfile.fasta.gz, to work with it in some bioinformatics software. The gunzip takes forever and I leave the computer and come back later.

I am worried that the process may have failed at some point, leaving a partial file. Is there any way to ascertain that newfile.fasta contains the complete decompressed content of newfile.fasta.gz?

inb4: "don't leave your computer"


Solution

  • It should be fine. If you're worried, then you could just check the file size: newfile.fasta.gz should be the size of file1.fasta.gz + file2.fasta.gz.

    Since it looks like you've already unzipped the new file, you could double check the number of sequence entries in each fasta file.

    $ gunzip -c file1.fasta.gz | grep -c '^>'
    $ gunzip -c file2.fasta.gz | grep -c '^>'
    $ grep -c '^>' newfile.fasta
    

    or if you could just substitute the "grep -c '^>'" for wc.