I'm coding an image uploader in PHP. It will allow the user to upload JPG and PNG images on a website. Next will be MP4 videos (as in the picture linked). Most importantly, my aim is to make this uploader as secure as possible.
( As a side note if you're interested, the uploader currently:
File content checking:
For instance, it's clear that inserting malicious PHP or Javascript code into a .JPG or any other file is very easy. Because of this, I've also prepared my uploader to remove all tags like '<?php', '<style...' or '<script...' from the contents of each file.
That seems to fix one problem, but does it create another? For instance, this media file (please see the linked picture) contains characters like '<?ph'. This totally harmless, non-functional '<?ph' is obviously generated programmatically without ill will. So are several ? > tags that can be found in the same media file. I mentioned this just to lead you to my real question:
Does something prevent JPG, PNG and MP4 encoders or other related programs from generating full <?php, <style..., <script... and other tags into the files? We got close without trying, so I think it's fair to ask.
If nothing is preventing that, then I should find better methods to deal with malicious code in media files. And even if my remover worked, I'm still interested in the "right" ways of doing it.
I hope my question wasn't too broad as I mentioned multiple file types. Any help is highly appreciated. Many thanks.
Bonus question: What about PDF, WEBM, FLV and other common media files: can they natively contain such full tags?
No, no algorithm or codec will avoid such an output on purpose.
<?php
and <style
can also come in several encodings: ASCII, UTF-16, UTF-32... those all will have a different binary outcome, yet it could be interpreted as text, just like PHP or HTML files can have any encoding. With your approach you must also consider searching for i.e. 0xff fe 3c 00 73 00 74 00 79 00 6c 00 65 00
to spot <style
encoded in UTF-16LE. And now do the same for uppercase text.
Yes, such an output can happen by coincidence: the bytes 0x3c 73 74 79
could be:
<
, s
, t
and y
猼
and 祴
㱳
and 瑹
2037674812
7,932e34
2040-11-20
and time 14:25:56
A group of 32 bit Integers could form up latin letters in ASCII, or also in UTF-16. It is up to the consumer to not overly interpret any file's content as what he wants it to be - valid PHP code even only needs to begin with <?
.
Files mostly have a format, which consist of the payload and additional storage, such as metadata. In a JFIF file the actual picture is the payload, while a potential thumbnail, a potential comment or potential Exif, IPTC, XMP or ICC blocks are metadata. The payload may have bytes that resemble to ASCII latin letters. In the file format any latin letters can also occur (as identification for an APP marker or for JFIF's comment). In the metadata any latin letters can also occur, again either because it is text, or by coincidence.
In a PNG file each chunk can by coincidence have the four latin letters <?
because of its 32 bit CRC field. Chunks don't need to be dedicated to store text (such as tEXt
), but could also carry any data that a decoder silently ignores, because it just doesn't know how to deal with it. And the picture payload can have such bytes by concidence, too.
WebM and FLV are containers, so not only their formats, but also their streams have multiple chances for such byte combinations - you have to expect VP8, VP9, Vorbis and Opus for WebM and Sorensen Spark, VP6, Screen video, H.264, MP3 and even more for FLV. PDF can contain both binary and text and is rather a nightmare to parse.
Neither will you discover all occurances of what looks like text and what you consider dangerous, nor will any of those file formats never contain something that can be interpreted as text. I'm interested in how you "remove" such findings without corrupting each file's format.
A better approach would be to recognize the file format: first looking for any signature and upon finding one making further tests until you're sure enough what you hold there. And if it fails you can reject the upload. What remains should never have a chance to be interpreted as PHP files, which can easily be configured.