Search code examples
packagejuliadataframes.jl

Julia Package DataFrames 1.6.1 doesn't recognize old version DataFrames 1.3.4 file


I made a files by Julia serialize and that was on DataFrames 1.3.4. Now package DataFrames new version is 1.6.1. But when using DataFrames 1.6.1, julia doesn't recognize old version DataFrames 1.3.4 file. So I have to pin DataFrames version to old one. Is it going to be solved?

Old version file link is below. https://github.com/andyname/ADiGit/blob/214acca242c5a51bee17cf58d01fc5b9eabc11d4/NAL_IDv3723v3765vSI_WON_IDv20642120v20924959.cs


Solution

  • Serialization in Julia is meant to be a short-term storage format. I am quoting the documentation (you can read the source here):

    The data format can change in minor (1.x) Julia releases, but files written by prior 1.x versions will remain readable. The main exception to this is when the definition of a type in an external package changes. If that occurs, it may be necessary to specify an explicit compatible version of the affected package in your environment. Renaming functions, even private functions, inside packages can also put existing files out of sync. Anonymous functions require special care: because their names are automatically generated, minor code changes can cause them to be renamed. Serializing anonymous functions should be avoided in files intended for long-term storage.

    In some cases, the word size (32- or 64-bit) of the reading and writing machines must match. In rarer cases the OS or architecture must also match, for example when using packages that contain platform-dependent code.

    In short - you should not assume that when you move serialized data between different platform/OS/Julia/package versions they can be read back again.

    What you can do assuming that platform, OS, and Julia versions remain the same, but only DataFrames.jl version changes nad I also assume that you are:

    • Using standard Julia types in your DataFrame (if not then the same problems will be caused by changing versions of the packages providing these types)
    • You do not use metadata (this is an extra comment for future readers, in DataFrames.jl 1.3.4 there was no support for metadata).

    Under these conditions the simplest thing to do is:

    1. Install DataFrames.jl 1.3.4 on in your project environment.
    2. Deserialize old data frame, e.g. to df variable.
    3. Convert this data frame to a NamedTuple by writing nt = Tables.columntable(df)
    4. Serialize nt
    5. Restart Julia session and start a new project with DataFrames.jl 1.6.1.
    6. Deserialize file storing nt.
    7. Re-create a data frame df = DataFrame(nt).

    This procedure will work because we fall back to standard Julia type (NamedTuple), which will be correctly serialized/deserialized on the same platform, OS, and Julia version.

    Now, why is this so complex? The reason is that serialization is not part of DataFrames.jl. It is a standard Julia mechanism unaware of DataFrames.jl package existence.

    For the future you might consider storing your data in e.g. Arrow.jl format. This format is independent from Julia or DataFrames.jl so it should be stable (and as a bonus you can load/save the files in this format in other ecosystems, e.g. Python or R).