Search code examples
pythonjavaserializationdeserialization

How to convert Java serialization data into JSON?


A vendor-provided application we're maintaining stores (some of) its configuration in the form of "Java serialization data, version 5". A closer examination shows, that the actual contents is a java.util.ArrayList with several dozens of elements of the same vendor-specific type (vendor.apps.datalayer.client.navs.shared.api.Function).

As we seek to deploy and configure instances of this application with Ansible, we'd like all configuration-files to be human-readable -- and subject to textual revision-control.

To that end, we need to be able to decode the Java serialization binary data into a human-readable form of some kind -- preferably, JSON. That JSON also needs to be convertible back into the same Java serialization format for the application to read it.

The accepted answer to an earlier question on this topic is Java-based:

  1. Read the Java serialization data using ObjectInputStream, casting it to the known type -- thus instantiating each object.
  2. Write it back out using GSON.

Though usable, that approach is less than ideal for us because:

  • it requires full knowledge of the vendor's type serialized in the data, even though we don't need to instantiate the objects;
  • we'd rather it be a Python-script, that we could integrate into Ansible.

There is a Python module for this, but custom classes seem to require providing custom Python code -- a lot of custom code -- even when all the fields of the class are themselves of standard Java-types.

It is my understanding, the serialized data itself already provides all the information necessary -- one does not need to access the class-definition(s), unless one wants to invoke the methods of the class, which we don't...


Solution

  • The documentation on the format is available here.

    To explain the example more thoroughly:

    00: ac ed 00 05 73 72 00 04 4c 69 73 74 69 c8 8a 15 >....sr..Listi...<
    10: 40 16 ae 68 02 00 02 49 00 05 76 61 6c 75 65 4c >Z......I..valueL<
    20: 00 04 6e 65 78 74 74 00 06 4c 4c 69 73 74 3b 78 >..nextt..LList;x<
    30: 70 00 00 00 11 73 71 00 7e 00 00 00 00 00 13 70 >p....sq.~......p<
    40: 71 00 7e 00 03                                  >q.~..<
    

    4b: 0xAC 0xED 0x00 0x05 – Magic header indicating this is java serialized data

    1b: 0x73 - First item that was stored is a java object (the format is recursive; 'top level' items are always going to be an object, at least in your situation).

    1b: 0x72 - It's a 'normal' object. Not a proxy, for example. Specifically, it's an object of a type we haven't seen before, so the immediately following bytes will first describe the class that this object is an instance of, before we get around to giving you the actual contents of this object.

    ?b: 0x00 0x04 | 0x4C 0x69 0x 73 0x74 - The name of the class that this object is an instance of, is List (0x00 0x04 is the size of the UTF-8 data). Normally it'd be com.foo.pkgname.ClassName but this example has an unfortunate name (because java.util.List exists, but this example isn't j.u.List, just some random example class also named List), and is in the unnamed package.

    8b: 0x69 0xc8 0x8a 0x15 0x40 0x16 0xae 0x68 – The serialVersionUID of this class. Irrelevant to you; just skip past 8 bytes here.

    1b: 0x02 - the flags for it; this one has the flag 'serializable'. Irrelevant to you.

    2b: 0x00 0x02 – this class contains 2 fields; they shall now be described.

    Described field 1

    1b: 0x49 (the letter I) it is a primitive field, of type integer.

    ?b: 0x00 0x05 | 0x76 0x61 0x6c 0x75 0x65 – The name of this field is 'value'.

    Described field 2

    1b: 0x4c (the letter L) it is an object field.

    ?b: 0x00 0x04 | 0x6e 0x65 0x78 0x74 – The name of this field is 'next'.

    1b: 0x74 - you now get the name of the type of the field, in the form of actual data; it's always a string. So now we get to see how strings are encoded: It starts with the constant TC_STRING, which is 0x74.

    ?b: 0x00 0x06 | 0x4c 0x4c 0x69 0x73 0x74 0x3b – The string LList;. This is a java JVM-styled typename. It usually looks like e.g. Ljava/lang/Number; for the type java.lang.Number (starts with L, packages divided by slashes, ends in a semicolon); here it's just List again because unnamed package.

    1b: 0x78 - the constant TC_ENDBLOCKDATA - there are no annotations for this class. In your case I bet there never is.

    1b: 0x70 - the constant TC_NULL - the superclass is now described, but this class's superclass is j.l.Object and as a special shortcut that is never shown (j.l.Object doesn't itself have a superclass and has no fields so there'd be no point).

    We have now described the class. Note that almost everything also stores itself for future reference so that the serialized data doesn't have to keep repeating this stuff over and over, which is represented by the zero-length item newHandle; you are supposed to have a big array of sorts, storing everything in it, and incrementing the counter every time you see newHandle, storing the thing you just read / are about to read. So such a description is only provided once, next time you just get the handle.

    The actual data now follows. Each value is just piled on one after the other; you need to use the described fields to track along so you know what you are looking at.

    Value of field 1

    4b: 0x00 0x00 0x00 0x11 - the first field was of type I (integer), as you may recall. ints in java are 4 bytes long, and here it is; the decimal value 17 in hexadecimal is 11; this is because the list was made with list1.value = 17;.

    Value of field 2

    1b: 0x73 – the constant TC_OBJECT. Remember up top I said it's a recursive format? The whole thing we are looking at was an object being stored. One of the fields in this object is referring to another object, so we now get to that, and it starts with 0x73 for the same reason. This object is also of type List (it's the list1.next = list2; field). We'll see that handle stuff soon. We need to go through the rigamarole of describing the class this object is an instance of again.

    1b: 0x71 – Last time we saw 0x72 here. This time we get 0x71 - a normal object, but it is an instance of a type we've seen before so it won't be described again. Instead you just get the handle.

    2b: 0x00 0x7e 0x00 0x00 – The handle. This is the ID that is referring to the definition of the List class we saw earlier. Handles start counting from 0x007E0000 (I have no idea why, but, spec says so), and it's the first newHandle-d thing we saw in this stream.

    At any rate, we're now done with the description by using a handle so we get straight to the data which you can't unpack without knowing the structure. We do: It is an int, followed by an object, so, first..

    Data for field 1

    4b: 0x00 0x00 0x00 0x13 - list2.value = 19 (19 dec is 0x13 in hex).

    Data for field 2

    it's null, so, we just see 0x70 for the null ref.

    The call out.writeObject(list1) is now completed. We now see very few bytes that all represent the result of out.writeObject(list2):

    The second object

    It's just:

    5b: 0x71 0x00 0x7e 0x00 0x03

    0x71 is reference to an object we saw before, and its handle is the next 4 bytes. It's referring to the actual object that variable list2 is pointing at, which we already stored before, and it's the 4th thing that newHandle-d, hence, the handle for it is 0x007E0003. You have to remember that java allows circular assignment, for example imagine I wrote list1.next = list1;, and if the storage format of serialization couldn't deal with references, trying to serialize list1 would crash with a stack overflow. Most JSON serializers really do crash on that, but java's does not.

    Presumably, in your situation, this circularity business won't bother you.

    Now, to the answer!

    I'm not aware of any library that can read these, and due to java serialization being able to store a wide variety of exotica, it's pretty much impossible to parse anything that has been serialized by a JVM without using that JVM and the classes that were in the classpath at the time it was serialized.

    But, if we assume no such exotica is going to happen (and that seems somewhat fair to do, if indeed the java code you are dealing with is just serializing fairly simple data objects), then.. it's not too hard to write some code that 'JSON-izes' and 'de-JSON-izes' java serially stored data. You'd have to write that yourself.

    But, there's good news here. The serialization format does include the names of fields as well as their types, though, the types are stated in terms of java. Still, that means, if the following things are ALL true:

    • No exotica such as proxies or functionrefs or whatever are serialized, just 'plain' objects of 'plain' classes,
    • These objects store only data written in terms of java primitives (booleans, integers (char/byte/short/int/long), and floats (float/double)), java strings, well known core java types (ArrayList, HashMap, that sort of thing), and objects of instances that adhere to this rule,

    then and only then could you write a converter that can convert java serialization formatted data to JSON and back again without the need of a JVM and without the need of those class definitions in the first place.

    It's not an itch I feel a need to scratch, but sounds like fun. Someone with some skill at crafting parsers for binary data should have little trouble making that in a person-day or 3.