I read the SequenceFile.java in hadoop-1.0.4 source codes. And I find the sync(long)
method
which is used to find a "sync marker" (a 16 bytes MD5 when generated at file creation time) in SequenceFile when splitting SequenceFile into file splits in MapReduce.
/** Seek to the next sync mark past a given position.*/
public synchronized void sync(long position) throws IOException {
if (position+SYNC_SIZE >= end) {
seek(end);
return;
}
try {
seek(position+4); // skip escape
in.readFully(syncCheck);
int syncLen = sync.length;
for (int i = 0; in.getPos() < end; i++) {
int j = 0;
for (; j < syncLen; j++) {
if (sync[j] != syncCheck[(i+j)%syncLen])
break;
}
if (j == syncLen) {
in.seek(in.getPos() - SYNC_SIZE); // position before sync
return;
}
syncCheck[i%syncLen] = in.readByte();
}
} catch (ChecksumException e) { // checksum failure
handleChecksumException(e);
}
}
These codes simply look for a data sequence which contain the same data as "sync marker".
My doubt:
Consider a situation where the data in a SequenceFile happen to contain a 16 bytes data sequence the same as "sync marker", the codes above will mistakenly treat that 16-bytes data as a "sync marker" and then the SequenceFile won't be correctly parsed?
I don't find any "escape" operation about the data or the sync marker. How can SequenceFile be binary safe? Am I missing something?
Clashes are technically possible, but in reality they are incredibly unlikely.
From http://search-hadoop.com/m/VYVra2krg5t1:
The probability of a given random 16-byte string appearing in a petabyte of (uniformly distributed) data is about 10^-23. It's more likely that your data center is wiped out by a meteorite (http://preshing.com/20110504/hash-collision-probabilities).