I assume that if I have 2 identical data.frames, the R digest function should return the same result. Consider these two data frames.
library(digest)
library(dplyr)
df1 <- tibble(a =1:5, b=11:15)
df2 <- df1 %>%
mutate(c=b-1) %>%
select(-c)
Both data.frames are identical when printed,
> df1
# A tibble: 5 × 2
a b
<int> <int>
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
or compared:
> df1 ==df2
a b
[1,] TRUE TRUE
[2,] TRUE TRUE
[3,] TRUE TRUE
[4,] TRUE TRUE
[5,] TRUE TRUE
However, the digest function returns different results:
> digest(df1)
[1] "4f82aa1035792a0acf304242ce6ad3ec"
> digest(df2)
[1] "3b7e697af67e8e36ba9b59aef69db304"
I would expect the digest function to result in the same result!! Is there a better way to compare identical data.frames?
I have no idea why the digest
is differing between the two. It is important to note, however, that this does not just happen with dplyr
df3 <- df1
df3$c <- 1
df3 <- df3[ ,-3]
digest(df3)
returns a third unique value
75f29cee80971220081372627632689f
Though it is interesting to note that the digest
of df4 <- df1[,1:2]
is the same. I can even generate that same hash from both df1
and df2
with:
digest(df1[,1:2])
digest(df2[,1:2])
and a different (shared) hash of "f111f4b3d65b8bc2569a4b79a821a6d8" with
digest(as.data.frame(df1[,1:2]))
digest(as.data.frame(df2[,1:2]))
It must have something to do with how R handles the creation of the variable as it stores it in memory. My understanding is that digest
is not giving a hash on the values within the variable but rather a hash of the variable itself. So, you may need to add a step that generates the variable in the same manner to get the hashes to agree.
However, if you are looking for a different way to compare, I will suggest all_equal
from dplyr
all_equal(df1, df2)
returns TRUE
and allows some handling of edge cases that might be nice (e.g., by default it doesn't care if the rows or columns are rearranged). If your goal is just to check the congruence of two datasets, this is likely a "better" way than fighting with digest