Search code examples
scpcorruptionhpcc

Data distortion after using "scp" for transfer


Recently I transferred a set of data from one server to hpcc(high-performance computing) Command is like:

scp /folder1/*.fastq.gz xxx@hpcc:/home/
scp /folder2/*.fastq.gz xxx@hpcc:/home/
scp /folder3/*.fastq.gz xxx@hpcc:/home/

I open several terminals to transfer the data at the same time. And in total I have ~50 such fastq.gz files, each around 10GB. I'm just wondering is there any possibility that data(esp. such large data) will be distorted when being transferred in the way mentioned above?

Because data on the server is in good-shape; while some data after being copied to hpcc is distorted.

thx thx


Solution

  • I strongly doubt that your data was corrupted in transit by scp(1).

    TCP provides a (weak) 16 bit CRC checksum of traffic streams. Because it is only sixteen bits long, relying upon TCP for data integrity means corrupted packets will still validate roughly one every (2^16) corrupted packets. I've long since lost the link (and the math), but vaguely recall that means corrupted data will be validated as correct once every two to four gigabytes across the public Internet -- though those numbers relied upon a specific error introduction rate at the time I read that statistic.

    SSH Version 2 introduced Message Authentication Checks into the protocol. These are negotiated between peers, but I expect the weakest allowed would be MD5, which provides for a 128 bit cryptographic hash of the data. Cryptographic hashes are far more advanced than the Cyclic Redundancy Checks that were more common for detecting data transmission errors two decades ago, and 128 bits is a significant expansion in checksum size. We might not trust MD5 enough to rely on it exclusively these days for resistance against dedicated attackers but it should be sufficient for discovering errors that happen by mistake in all but the most incredible circumstances.

    I would look elsewhere for your corruption -- first and foremost, the destination drives where you stored your data.