Reliably identifying a JPG?

For the purpose of identifying and comparing JPG images taken from cameras I want to calculate a MD5 hash of the scan portion of the image inside the JPG. My idea is to take the bytes between the SOS and the EOI marker and perform a hash on those bytes based on the assumption that these bytes will never change unless the actual image is processed and altered.

Apparently this question has come up already several times 1,2, 3. Rather complicated solutions have been suggested, a fact that I find irritating looking at my rather simple but apparently effective approach. (Or is it too simple to be true?)

I know there can be multiple pairs of SOS ($FFDA) and EOI ($FFD9) in a JPG file, in my present files there are 3: A thumbnail, the actual image and an additional 1920x1080 image (Sony). My present approach is to parse the stream and locate the next SOS, then look for EOI, calculate the size and assume the actual image if the size exceeds 50% of the file size.

This approach works with my present files. I stripped all metadata from a JPG file with exiftool -all= image.jpg and found the MD5 hash to be identical. Yet the algorithm seems rather coarse to me. So here are my questions:

Is there any risk that simply examining the space between SOS and EOI can fail? I have read this, but am still not sure.

Parsing every byte from the SOS of the actual image takes a lot of time. I take it from here that there is no shortcut to finding the end of the compressed data. But I might just leap forward 80% or so from the second SOS marker. I am talking about images from a camera - how much can I rely on the fact that there will be a thumbnail coming first and the actual image after it?

Should I start 6 Bytes after SOS (here?)

Any ideas for a better approach?

Solution

After doing some research and running an bunch of tests here I present my solution to my question.

First, I want to make clear that we are not talking about a forensic investigation. There are possibly ways to manipulate a JPG image in a way that markers appear where they shouldn't and do not appear where would have to according to the specs.

We are not talking about image identity or similarity, either. If you losslessly rotate a JPG you still have the very same image information, but not the identical image any more. We're not talking, either, about images that have been resized, optimized or altered in any other way.

What we are talking about is identifying simple duplicates or JPGs that have been renamed or where metadata has been modified or removed, but where the image itself has never been processed or tampered with in any way.

Is a hash of the bytes between the SOS and the EOI markers a reliable way to uniquely identify an image?

Yes, it is. Within bounds of reason there is no way two files with identical MD5 checksums of the image scan data can contain non-identical images and vice versa.
I examined sample photos taken with cameras from 12 different makers and edited/stripped the metadata. Actually, this wasn't really necessary, because from the specs and the code you know that all metadata resides in separate blocks (that's why you can hide all kind of stuff in a JPG) and the scan data will never be touched by metadata operations, but yes, identical MD5 checksums all over the place.

Is there any way to quickly locate the (right) SOS marker?

Definitely. The JPG specs are a mess and a punishment. After trying quite a few pieces of code I found NativeJPG by Nils Haeck to be the most straightforward. This has been adapted from sdJpegImage:

function FindSOSPos(S: TStream): Cardinal;
var
  B, MarkerTag, BytesRead: byte;
  Size,W: word;
const
  mkNone = 0; mkSOF0 = $c0; mkSOF1 = $c1; mkSOF2 = $c2; mkSOF3 = $c3; mkSOF5 = $c5; 
  mkSOF6 = $c6; mkSOF7 = $c7; mkSOF9 = $c9; mkSOF10 = $ca; mkSOF11 = $cb; mkSOF13 = $cd; 
  mkSOF14 = $ce; mkSOF15 = $cf; mkDHT = $c4; mkDAC = $cc; mkSOI = $d8; mkEOI = $d9; mkSOS = $da; 
  mkDQT = $db; mkDNL = $dc; mkDRI = $dd; mkDHP = $de; mkEXP = $df; mkAPP0 = $e0; mkAPP15 = $ef; mkCOM = $fe; 
begin
  Repeat
    Result := 0;
    // Read markers from the stream, until a non $FF is encountered
    If S.Read(B, 1) = 0 then
      exit;
    // Do we have a marker?
    if B = $FF then
    begin
      BytesRead := S.Read(MarkerTag, 1);
      while (BytesRead > 0) and (MarkerTag = $FF) do
      begin
        MarkerTag := mkNone;
        BytesRead := S.Read(MarkerTag, 1);
      end;
      Size := 0;
      if MarkerTag in [mkAPP0..mkAPP15, mkDHT, mkDQT, mkDRI,
        mkSOF0, mkSOF1, mkSOF2, mkSOF3, mkSOF5, mkSOF6, mkSOF7, mkSOF9, mkSOF10, mkSOF11, mkSOF13, mkSOF14, mkSOF15,
        mkCOM, mkDNL] then
      begin
        // Read length of marker
        If S.Read(W, 2) = 2 then
          Size := Swap(W) - 2
        else exit;
      end else
        If MarkerTag = mkSOS
          then break;
      S.Position := S.Position + Size;
    end else
    begin
      // B <> $FF is an error, we try to be flexible
      repeat
        BytesRead := S.Read(B, 1);
      until (BytesRead = 0) or (B = $FF);
      if BytesRead = 0 then
        exit;
      S.Seek(-1, soFromCurrent);
    end;
  Until (MarkerTag = mkSOS) or (MarkerTag = mkNone);
  Result := S.Position;
end;

Omit the first 6 Bytes after the SOS marker?

I decided to hash everything between SOS and EOI excluding the markers themselves.

Is there a fast way to locate the trailing EOI marker?

No. But this is irrelevant, since for performing a hash you have to read every single byte anyway.

How reliable is this approach?

As I said, I believe that within bounds of reason the chance that this approach will render no false positives is practically 100%. As to locating the right image: NativeJPG has been around for more than 10 years and you find very few complaints, if any they deal with decoding the image, not missing it.

In my application I offer the option to store the original filename, the EXIF DateTimeDigitized, the camera make, the GPS coordinates and MD5 hashes of the scan data (full and first 16 kB) in the UserComment field. I'm pretty confident that this will allow to lateron identify the file under most conditions (if the UserComment has remained intact).