I am working with large datasets that I have concatenated using: cat file1.fasta.gz file2.fasta.gz > newfile.fasta.gz
Then I unzip newfile using: gunzip newfile.fasta.gz
, to work with it in some bioinformatics software. The gunzip takes forever and I leave the computer and come back later.
I am worried that the process may have failed at some point, leaving a partial file. Is there any way to ascertain that newfile.fasta
contains the complete decompressed content of newfile.fasta.gz
?
inb4: "don't leave your computer"
It should be fine. If you're worried, then you could just check the file size: newfile.fasta.gz should be the size of file1.fasta.gz + file2.fasta.gz.
Since it looks like you've already unzipped the new file, you could double check the number of sequence entries in each fasta file.
$ gunzip -c file1.fasta.gz | grep -c '^>'
$ gunzip -c file2.fasta.gz | grep -c '^>'
$ grep -c '^>' newfile.fasta
or if you could just substitute the "grep -c '^>'" for wc.