validation machine-learning file-format feature-extraction fuzzing

Tools for Feature Extraction from Binary Data of Images

I am working on a project where I am have image files that have been malformed (fuzzed i.e their image data have been altered). These files when rendered on various platforms lead to warning/crash/pass report from the platform.

I am trying to build a shield using unsupervised machine learning that will help me identify/classify these images as malicious or not. I have the binary data of these files, but I have no clue of what featureSet/patterns I can identify from this, because visually these images could be anything. (I need to be able to find feature set from the binary data)

I need some advise on the tools/methods I could use for automatic feature extraction from this binary data; feature sets which I can use with unsupervised learning algorithms such as Kohenen's SOM etc.

I am new to this, any help would be great!

Solution

I do not think this is feasible.

The problem is that these are old exploits, and training on them will not tell you much about future exploits. Because this is an extremely unbalanced problem: no exploit uses the same thing as another. So even if you generate multiple files of the same type, you will in the end have likely a relevant single training case for example for each exploit.

Nevertheless, what you need to do is to extract features from the file meta data. This is where the exploits are, not in the actual image. As such, parsing the files is already much the area where the problem is, and your detection tool may become vulnerable to exactly such an exploit.

As the data may be compressed, a naive binary feature thing will not work, either.