Parsing XML backup of a plone site

I have the task of parsing a huge backup of a Plone ZODB. There was no other way to get the backup but in a XML-file which is roughly 433mb big.

Please don't ask why or how, I just got the task of parsing the file in order to retrieve pictures, files and other important data.

I have written a StAX based XML parser in Java and it works so far as I now can read the file, store information and print it into a txt file if necessary.

Now the problem for me is: where is the data I need to retrieve. As far as I can read the XML-file (which is pretty difficult even with 16GB of memory), its nodes are all the same, only the attributes differ from another (i. e. "id" and "aka" in the record nodes[of which there are more than 40000]).

Is there any Plone or ZODB Dev who can help and point me into the direction of how and where data is stored in such a XML file? What kind of data do I need to feed to my parser to find, store and print the information.

Or is there any other idea on how I can retrieve the data from the XML file?

Please bear in mind, I >>cannot<< use anything else but this Plone.xml as basis. I also won't be able to share the file for obvious reasons of privacy and security.

Solution

The XML format represents ZODB object entries.

The ZODB uses the pickle module as the basis for serialising objects to a sequence of bytes. The XML file format tries to give you separate XML tags for the Python primitive types (numbers, strings, containers), but you still get the 'raw' object data, which can contain a lot of entries that are probably not all that interesting for your task.

In the ZODB, a whole object tree is stored; objects containing other objects containing yet more. To prevent any change in this tree requiring a complete rewrite of the stored data, objects can inherit from a dedicated persistence class that tracks changes to just that object separately, and records then use references to those separate records.

The XML format then contains, at the top-level, <record> elements; these represent separate objects with attributes in the tree, and if these contain other persistent objects, the references between them are encoded as <persistent> elements; looking something like:

<persistent>
  <tuple>
      <string id="[persistentid.subid]" encoding="base64">[base64-encoded-persistentid]</string>
      <global id="[persistentid.subid]" name="[classname]" module="[module for class]"/>
  </tuple>
</persistent>

This then represents a Python tuple with two values; a base64-encoded persistent ID (a record reference) and a Python object reference; the latterly can be ignored as the same information is encoded in the referenced <record> element.

The persistent ID value refers to another record; the simplest way to dereference these is by matching it against the aka attribute of a <record> tag:

<record id="[persistentid]" aka="[base64-encoded-persistentid]">

The persistent ID is really a 8-byte big-endian representation of an unsigned long integer; the id attribute represents the same number:

>>> import struct
>>> 'AAAAAAAAAGU='.decode('base64')
'\x00\x00\x00\x00\x00\x00\x00e'
>>> struct.unpack('>Q', 'AAAAAAAAAGU='.decode('base64'))
(101,)

Each <record> tag then contains 1 or 2 <pickle> tags; the first encoding the object type, the second, if present, the state of the object. Without a second record the object is just empty:

<record id="[persistentid]" aka="[base64-encoded-persistentid]">
  <pickle>
    <global id="[persistentid].1" name="[classname]" module="[module for class]"/>
  </pickle>
  <pickle>
      <!-- ... -->
  </pickle>
</record>

What type is used for the state depends on the specific class of the pickled object; the default is to take the class __dict__ and encode that, but specific implementations can opt to implement a custom __getstate__ method (and corresponding __setstate__). For the BTrees package, for example, you typically will find both key-value pairs and Bucket objects, which are there just to break up a larger btree into separate records.

Any instances of classes that are not inheriting from the special persistence class (and thus don't get a separate record) are stored as <object> tags with the Python class recorded as <klass> tag, followed by a tuple for the initial object arguments, plus an optional state.

If you are looking for large binary content (images, files) you may be out of luck, as all modern Plone versions use the ZODB BLOB support where such data is stored in separate files. The XML file will merely point to empty persistent records where the ZODB blob contents are then found by other means:

<record id="11545" aka="AAAAAAAALRk=">
  <pickle>
    <global id="11545.1" name="Blob" module="ZODB.blob"/>
  </pickle>
  <pickle>
    <none/>
  </pickle>
</record>

The <none/> tag represents the Python None object (equivalent to null in Java). The blob data is then not included with the export.

Other random notes:

<reference> tags represent a reference to an object already encoded earlier, but not one that has a separate persistent <record>; these point to [persistentid.subid] values. There is little point in recording the same object more than once after all.
The <unicode> tag values are encoded with UTF-8; the encoding attribute will never be set.
The DateTime.DateTime module has registered a wrapper around an internal copy_reg module function used to handle extension types; you'll likely to find entries along the lines of:
```
<object id="5406.12">
  <klass>
    <global id="5406.9" name="_dt_reconstructor" module="DateTime.DateTime"/>
  </klass>
  <tuple>
    <global id="5406.10" name="Splitter" module="Products.CMFPlone.UnicodeSplitter.splitter"/>
    <global id="5406.11" name="object" module="__builtin__"/>
    <none/>
  </tuple>
</object>
```
Here the _dt_reconstructor is used to create a new copy of Products.CMFPlone.UnicodeSplitter.splitter.Splitter instead; it has no other state (there is no <state> tag).