Search code examples
rmd5md5sum

Comparing the MD5 sum of a string to the contents of a file


I am trying to compare a string (in memory) to the contents of a file to see if they are the same. Boring details on motivation are below the question if anyone cares.

My confusion is that when I hash file contents, I get a different result than when I hash the string.

library(readr)
library(digest)

# write the string to the file
the_string <- "here is some stuff"
the_file <- "fake.txt"
readr::write_lines(the_string, the_file)

# both of these functions (predictably) give the same hash
tools::md5sum(the_file)
# "44b0350ee9f822d10f2f9ca7dbe54398" 

digest(file = the_file)
# "44b0350ee9f822d10f2f9ca7dbe54398"

# now read it back to a string and get something different
back_to_a_string <- readr::read_file(the_file)
# "here is some stuff\n"

digest(back_to_a_string)
# "03ed1c8a2b997277100399bef6f88939"

# add a newline because that's what write_lines did
orig_with_newline <- paste0(the_string, "\n")
# "here is some stuff\n"

digest(orig_with_newline)
# "03ed1c8a2b997277100399bef6f88939"

What I want to do is just digest(orig_with_newline) == digest(file = the_file) to see if they're the same (they are) but that returns FALSE because, as shown, the hashes are different.

Obviously I could either read the file back to a string with read_file or write the string to a temp file, but both of those seem a bit silly and hacky. I guess both of those are actually fine solutions, I really just want to understand why this is happening so that I can better understand how the hashing works.

Boring details on motivation

The situation is that I have a function that will write a string to a file, but if the file already exists then it will error unless the user has explicitly passed .overwrite = TRUE. However, if the file exists, I would like to check whether the string about to be written to the file is in fact the same thing that's already in the file. If this is the case, then I will skip the error (and the write). This code could be called in a loop and it will be obnoxious for the user to continually see this error that they are about to overwrite a file with the same thing that's already in it.


Solution

  • Short answer: I think you need to set serialize=FALSE. Supposing that the file doesn't contain the extra newline (see below),

    digest(the_string,serialize=FALSE) ==  digest(file=the_file) ## TRUE
    

    (serialize has no effect on the file= version of the command)

    dealing with newlines

    If you read ?write_lines, it only says

    sep: The line separator ... [information about defaults for different OSes]

    To me, this seems ambiguous as to whether the separator will be added after the last line or not. (You don't expect a "comma-separated list" to end with a comma ...)

    On the other hand, ?base::writeLines is a little more explicit,

    sep: character string. A string to be written to the connection after each line of text.

    If you dig down into the source code of readr you can see that it uses

          output << na << sep;
    

    for each line of code, i.e. it's behaving the same way as writeLines.

    If you really just want to write the string to the file with no added nonsense, I suggest cat():

    identical(the_string, { cat(the_string,file=the_file); readr::read_file(the_file) }) ## TRUE