Removing potential malware in image in Python

I've read that .PNG, .JPEG, and other image file types can potentially contain malware.

I am looking to remove potential malware embedded in user-uploaded images. Is there a way in Python to essentially "flatten" an image that would essentially remove any malicious content? Kind of like if you were to take a screenshot of the image then save the screenshot? Or maybe there is an image type that can't be corrupted as easily?

I am already hosting all user uploaded content on a separate domain, but am wondering if I can take this a step further.

Solution

At its simplest level, a bitmap image contains two things:

meta-data, which is information about the image, and
pixel data, which are the pixel colours themselves.

The meta-data contains critical things like the image height and width, the number of channels, the bits per pixel, the image's colourspace and how it is compressed. It also contains, arguably less-critical supplementary information such as:

EXIF data - what camera was used, what lens, what exposure, GPS info and so on
ICC Colour Profiles for accurate colour reproduction
IPTC information - press and telecoms info, copyright, subject-matter tagging and so on
Geo-referencing and/or photogrammetry information - see GeoTIFF
comments - which can contain arbitrary information (and malware)

The pixel data contains the colours (and potentially any transparency) of the grid of pixels making up the image. It is often compressed.

Note that the above is at a simple level. I only mentioned bitmap files without referring to vector files such as SVG files which can contain their own set of problems such as the "Billion Laughs DoS Attack" see https://en.wikipedia.org/wiki/Billion_laughs_attack

Note also that it is entirely possible to append an entire executable program to the end, or in the middle of an image, without necessarily upsetting image readers/display programs which, in general, ignore information they can't understand but try their best to use the parts they do. If you want an example, here I make a red image with ImageMagick and append 128kB of arbitrary data to the end and display it in the Terminal on a Mac without any complaint from macOS:

magick -size 1024x768 xc:red image.png             # make red image
dd if=/dev/zero bs=128 count=1024 >> image.png     # append 128kB of whatever I like - not actually malware in this case
open imge.png                                      # use "xdg-open" on Linux

Note also that it is possible to embed other information using steganography, like for example, hijacking the least significant bit of every pixel and using it to convey a message or carry some unexpected payload such as malware or watermark. As it is the least significant bit, it is normally visually imperceptible.

So, now the question is what tradeoffs you wish to make, or put another way "how paranoid are you?" The more information you decide to strip from your image, the more likely you are to unintentionally lose some information you later need. If you strip the EXIF data, you will no longer know when the image was shot, or where, or by whom. If you strip the ICC Colour Profile, your image may appear washed out, or over-saturated, or green in some viewers. If you strip the IPTC information, you may be committing a licence-infringement if you are required by contract to retain it. If you strip the geo-referencing information, your data may become useless. If you strip the comments, you may lose masking information, or copyright, or tagging information. If you change the format from PNG/TIFF/GIF to JPEG, you will lose transparency and accuracy. If you change from TIFF to PNG, you will lose the ability to store 32-bit, 64-bit or floating point data and more than 4 channels. If you change from JPEG to PNG, you may inadvertently make the file many tens or hundreds of times larger.

So, pretty much the most paranoid action you could take would be to load the bitmap into memory , save it (preferably in memory rather than to disk for performance reasons) in a format that is incapable of storing anything other pixel data (e.g. PPM or raw RGB(A) bytes) and re-save it as your JPEG or PNG. That will discard all EXIF/IPTC/Geo-data and comments as well as any tacked on extraneous data at the end or in the middle of the image. If you want a concrete example, you could use the following ImageMagick command in Terminal:

magick input.jpg -strip ppm:- | magick ppm:- result.jpg

If you were using PIL/Pillow and Python, you could do:

from PIL import Image
import numpy as np

# Load image
im = Image.open('image.jpg')                                     

# Convert to format that cannot store IPTC/EXIF or comments, i.e. Numpy array
na = np.array(im)                                                                       

# Create new image from the Numpy array and save
result = Image.fromarray(na).save('clean.jpg')

If your image is in PNG format though, you have added complications - it may be a palette image and it may have alpha/transparency information and you will likely want to retain that. That might look like this:

from PIL import Image
import numpy as np

# Load image
im = Image.open('image.png')                                     

# Convert to format that cannot store IPTC/EXIF or comments, i.e. Numpy array
na = np.array(im)                                                                       

# Create new image from the Numpy array
result = Image.fromarray(na)

# Copy forward the palette, if any
palette = im.getpalette()
if palette != None:
    result.putpalette(palette)

# Save result
result.save('clean.png')

If you need to preserve some meta-data, you will need to consider other options.