Search code examples
rcomparedplyrdigest

Why is the digest of a data.frame changed after the use of dplyr in R?


I assume that if I have 2 identical data.frames, the R digest function should return the same result. Consider these two data frames.

library(digest)
library(dplyr)
df1 <- tibble(a =1:5, b=11:15)
df2 <-  df1 %>% 
        mutate(c=b-1) %>% 
        select(-c)

Both data.frames are identical when printed,

> df1
# A tibble: 5 × 2
  a     b
  <int> <int>
1     1    11
2     2    12
3     3    13
4     4    14
5     5    15

or compared:

> df1 ==df2
        a    b
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE

However, the digest function returns different results:

> digest(df1)
[1] "4f82aa1035792a0acf304242ce6ad3ec"
> digest(df2)
[1] "3b7e697af67e8e36ba9b59aef69db304"

I would expect the digest function to result in the same result!! Is there a better way to compare identical data.frames?


Solution

  • I have no idea why the digest is differing between the two. It is important to note, however, that this does not just happen with dplyr

    df3 <- df1
    df3$c <- 1
    df3 <- df3[ ,-3]
    
    digest(df3)
    

    returns a third unique value

    75f29cee80971220081372627632689f
    

    Though it is interesting to note that the digest of df4 <- df1[,1:2] is the same. I can even generate that same hash from both df1 and df2 with:

    digest(df1[,1:2])
    digest(df2[,1:2])
    

    and a different (shared) hash of "f111f4b3d65b8bc2569a4b79a821a6d8" with

    digest(as.data.frame(df1[,1:2]))
    digest(as.data.frame(df2[,1:2]))
    

    It must have something to do with how R handles the creation of the variable as it stores it in memory. My understanding is that digest is not giving a hash on the values within the variable but rather a hash of the variable itself. So, you may need to add a step that generates the variable in the same manner to get the hashes to agree.

    However, if you are looking for a different way to compare, I will suggest all_equal from dplyr

    all_equal(df1, df2)
    

    returns TRUE and allows some handling of edge cases that might be nice (e.g., by default it doesn't care if the rows or columns are rearranged). If your goal is just to check the congruence of two datasets, this is likely a "better" way than fighting with digest