Search code examples
rserializationopensslmd5digest

Difference of MD5 Hash in R-Librarys - MD5 for serialized objects


I want to calculate a MD5 Hash for an R Object. This is usually done with the serialized object. I am aware of two differect R libs that can calculate MD5 hashes - the digest library and the openssl library. But these two return different hash values. Here is an example fore the openssl library:

test <- 1:100

library(openssl )
md5(serialize(test, connection = NULL))
# returns: md5 23:a8:b3:40:9e:08:a0:3d:30:6e:3d:3d:cb:fe:21:57 

Now the example for the digest library:

library(digest)
digest(test,"md5",serialize = T)
# returns: [1] "83777773fa047247723ad5a255963144"

Why are these hash values different?


Solution

  • Short answer

    digest skips some leading bits if the object is serialized.

    For example:

    > .t <- serialize(test, connection = NULL)
    > md5(.t[seq(15, length(.t))])
    md5 83:77:77:73:fa:04:72:47:72:3a:d5:a2:55:96:31:44
    

    Long answer

    The result of serialize(1:100, connection = NULL) is different if the R version is different.

    According to the source code of base::serialize, R writes some integers which represent the R version during the serialization.

    digest::digest skips these bits before calculating md5sum, so the result will be consistent.