Search code examples
matlabscipymat-file

What is contained in the "function workspace" field in .mat file?


I'm working with .mat files which are saved at the end of a program. The command is save foo.mat so everything is saved. I'm hoping to determine if the program changes by inspecting the .mat files. I see that from run to run, most of the .mat file is the same, but the field labeled __function_workspace__ changes somewhat.

(I am inspecting the .mat files via scipy.io.loadmat -- just loading the files and printing them out as plain text and then comparing the text. I found that save -ascii in Matlab doesn't put string labels on things, so going through Python is roundabout, but I get labels and that's useful.)

I am trying to determine from where these changes originate. Can anyone explain what __function_workspace__ contains? Why would it not be the same from one run of a given program to the next?

The variables I am really interested in are the same, but I worry that I might be overlooking some changes that might come back to bite me. Thanks in advance for any light you can shed on this problem.

EDIT: As I mentioned in a comment, the value of __function_workspace__ is an array of integers. I looked at the elements of the array and it appears that these numbers are ASCII or non-ASCII character codes. I see runs of characters which look like names of variables or functions, so that makes sense. But there are also some characters (non-ASCII) which don't seem to be part of a name, and there are a lot of null (zero) characters too. So aside from seeing names of things in __function_workspace__, I'm not sure what that stuff is exactly.

SECOND EDIT: I found that after commenting out calls to plotting functions, the content of __function_workspace__ is the same from one run of the program to the next, so that's great. At this point the only difference from one run to the next is that there is a __header__ field which contains a timestamp for the time at which the .mat file was created, which changes from run to run.

THIRD EDIT: I found an article, http://nbviewer.jupyter.org/gist/mbauman/9121961 "Parsing MAT files with class objects in them", about reverse-engineering the __function_workspace__ field. Thanks to Matt Bauman for this very enlightening article and thanks to @mpaskov for the pointer. It appears that __function_workspace__ is an undocumented catch-all for various stuff, only one part of which is actually a "function workspace".


Solution

  • 1) Diffing .mat files

    You may want to take a look at DiffPlug. It can do diffs of MAT files and I believe there is a command line interface for it as well.

    2) Contents of function_workspace

    SciPy's __function_workspace__ refers to a special variable at the end of a MAT file that contains extra data needed for reference types (e.g. table, string, handle, etc.) and various other stuff that is not covered by the official documentation. The name is misleading as it really refers to the "Subsystem" (briefly mentioned in the official spec as an offset in the header).

    For example, if you save a reference type, e.g., emptyString = "", the resulting .mat will contain the following two entries:

    (1) The variable itself. It looks sort of like a UInt32 matrix, but is actually an Opaque MCOS Reference (MATLAB Class Object System) to a string object at some location in the subsystem.

     [0] Compressed (81 bytes, position = 128)
      [0] Matrix (144 bytes, position = 0)
        [0] UInt32[2] = [17, 0] // Opaque
        [1] Int8[11] = ['emptyString'] // Variable Name
        [2] Int8[4] = ['MCOS'] // Object Type
        [3] Int8[6] = ['string'] // Class Name
        [4] Matrix (72 bytes, position = 72)
          [0] UInt32[2] = [13, 0] // UInt32
          [1] Int32[2] = [6, 1] // Dimensions
          [2] Int8[0] = [''] // Variable Name (not needed)
          [3] UInt32[6] = [-587202560, 2, 1, 1, 1, 1] // Data (Reference Target)
    

    (2) A UInt8 matrix without name (SciPy renamed this to __function_workspace__) at the end of the file. Aside from the missing name it looks like a standard matrix, but the data is actually another MAT file (with a reduced header) that contains the real data.

    [1] Compressed (251 bytes, position = 217)
      [0] Matrix (968 bytes, position = 0)
        [0] UInt32[2] = [9, 0] // UInt8
        [1] Int32[2] = [1, 920] // Dimensions
        [2] Int8[0] = [''] // Variable Name
        [3] ... 920 bytes ... // Data (Nested MAT File)
    

    The format of the data is unfortunately completely undocumented and somewhat of a mess. I could post the contents of the Subsystem, but it gets somewhat overwhelming even for such a simple case. It's essentially a MAT file that contains a struct that contains a special variable (MCOS FileWrapper__) that contains a cell array with various values, including one that magically encodes various Object Properties.

    Matt Bauman has done some great reverse engineering efforts (Parsing MAT files with class objects in them) that I believe all supporting implementations are based on. The MFL Java library contains a full (read-only) implementation of this (see McosFileWrapper.java).

    Some updates on Matt Bauman's post that we found are:

    • The MCOS reference can refer to an array of handle objects and may have more than 6 values. It contains sizing information followed by an array of indices (see McosReference.java).
    • The Object Id field looks like a unique id, but the order seems random and sometimes doesn't match. I don't know what this value is, but completely ignoring it seems to work well :)
    • I've seen Segment 5 populated in .fig files, but I haven't been able to narrow down what's in there yet.

    Edit: Fyi, once the string object is correctly parsed and all properties are filled in, the actual string value is encoded in yet another undocumented format (see testDoubleQuoteString)