Search code examples
typesdecodeencode

How can I determine differences between different encodings/serializations/etc?


There's all types of decoders for data formats such as Base64, the ASP EventValidation object, XML serialization, or otherwise? Is there a simple test I can do?

For example, I have a string here, it's part of a cgi-based web form, it's obviously hex (full size is 5kb): 52616e646f6d49567ef61b360522ae5ae69064f0ecb664a831c4196dad319215013aa8d04726b5d54ed673dad2004726c35e66d8b19c5177a331b24988f3cf11871084f6cc9ff808baf5cdee83f031a56dc42b65ee5309f1f1

I got no idea what that is, hex to ascii gives me some more nonsense like Ra_d__IVo6"Odd1_1/G&?sG&OfQw1I1_eS, it's obviously not a base 64 string...

The question is basically: is there a method other than looking at differnt types, trying it, and guessing?

edit: I think this string is encrypted data based on the perpended 52616e646f6d4956, but my question isn't what is the string, rather, how I can tell these things easily.


Solution

  • You can develop your own heuristic algorithm. Similar to a virus scanner. It doesn't work 100%, but it should improve over time. For example, you could take the string and note that it contains only characters from the hex alphabet, flag it for the possibility of being encrypted, zipped or whatever else that is related to the hex character set.

    You could extend the heuristic to try N different encodings and perform word count's. This could help narrow down the possibilities of the encoding's, but in the simple case with say the standard english alphabet there's plenty of overlap across encoding tables so you will certainly get false positives. But, as long as the overlap doesn't contain character's outside/mismatching you should still get readable content.

    As Marc pointed out, not all content is necessarily readable content. Pictures, zip files, and a list of other data will result in pure nonsense when converted to an encoding table representation. But, even items such as these have potential to contain consistent data to be detected by the heuristic.

    This topic can get pretty involved. Just look at the TCP protocol. One doesn't just fire packets across the internet expecting some magical interpretation of data on the client side. There are pre-defined rules (protocols) to define the way and type of data to be transmitted between the client/server. So, to directly answer your question regarding "guessing", you cannot be certain of the data you will receive nor of your interpretation, but you certainly can develop an application that is smarter than a "guess".