detect truncated jpeg images by comparing header dimensions to actual data length

A PowerShell script retrieves inbound mail messages sent from mobile phones and stores jpeg file attachments in a database. Unfortunately, the mail messages are often sent from areas with poor cell service and the mail messages get truncated, usually mid-attachment. Even though the mail messages have been truncated the mail server still accepts them. As described in some of the postings on Stack Overflow and elsewhere one possible way to check for whether an attachment is complete is to look for the FF D9 bytes marking the end of a jpeg file:

$binaryReader = New-Object BinaryReader([File]::Open($filePath, [FileMode]::Open)) 
$binaryReader.BaseStream.Seek(-2, [SeekOrigin]::End)
[byte[]]$bytes = New-Object byte[] 2
$binaryReader.Read($bytes, 0, 2)
if (($bytes[0] -eq 0xFF) -and ($bytes[1] -eq 0xD9)) {

Unfortunately, it seems that for some mobile carriers or possibly a combination of mobile carrier and phone OS the jpeg images have extra bytes appended. The resulting jpeg images are not truncated and can be loaded in ImageMagick and viewed with standard graphics viewers but the above test will fail. Many jpeg attachments end with a variable blob of data ending in the following eight-byte sequence: 0x57 0x40 0x40 0x43 0x72 0x65 0x65 0x66 but there are other variations.

It occurred to me that if the jpeg headers specify the height and width of the image perhaps there is a different approach for testing for truncation. Code could load the image and attempt to read the pixel at the bottom-right corner and see if there is an error.

$bitmap = [System.Drawing.Bitmap]::FromFile($filePath)
$pixelColor = $bitmap.GetPixel($bitmap.Width - 1, $bitmap.Height - 1)

I grabbed a severely truncated jpeg file -- one that has a small file size and that when displayed in an image viewer has a rectangular strip of the top of the photo that is visible but the rest is blank. When running the above code against the file the width and height from the Bitmap object were 2560 x 1536 which are typical dimensions for a non-trucated file. I was hoping that the GetPixel call to retrieve the color of the last pixel would return null or throw an exception but it did not. It returned an RGB value just as if the file were not truncated.

I am running this code under PowerShell 4 and the .NET Framework 4 on Windows Server 2012. I thought that perhaps that when instantiating the bitmap object .NET had allocated a memory buffer large enough to hold the bitmap based on the dimensions from the jpeg header and then loaded as much data as was available. However, when I sampled various pixels near the bottom-right corner, the color object had data. Here is the color value at position x=2559, y=1535: R:114, G:113, B:111.

This does not look to be a default gray color used when no data is available because other adjacent pixels had different values. For what it's worth the RGB values for the small sample of pixels I looked at in the blank area tended to be in the range of 110 to 116. By contrast there was much more variance in the RGB values in the top-left corner.

Why doesn't this approach work? When fed a truncated file, why would the .NET Framework Bitmap object not throw an error? Are the phantom pixel color values coming from uninitialized memory? Is there anything else I should try in the way of coming up with a reliable test for truncation?

Solution

ImageMagick will detect truncated JPEG files. For example:

$ convert -regard-warnings truncated.jpg x.png
convert: Premature end of JPEG file `truncated.jpg' @ warning/jpeg.c/JPEGWarningHandler/352.
convert: Corrupt JPEG data: premature end of data segment `truncated.jpg' @ warning/jpeg.c/JPEGWarningHandler/352.
$ echo $?
1

The -regard-warnings flag makes convert return a non-zero exit code on a warning.

Alternatively, the IJG JPEG decoder will warn on truncated files. If you're prepared to write some C, you could run that over your images.

The process would be something like:

Point the decompressor at your file.
Repeatedly fetch scanlines until you have seen the whole image.
Check the num_warnings field in the error manager. If it's >0, you have problems.

The example.c in the distribution is very helpful. There's also libjpeg-turbo, which is ABI-compatible with the IJG decoder and a lot quicker, if speed is an issue.