Search code examples
rubydigest

How can I generate a reliable digest from a hash in Ruby 2.4?


I have a very large table containing 2 billion rows of 50 attributes. Not all are filled out and it's a sparse matrix.

I dislike having to build query off of all of the values, and the indexes are much too large now. I've lost performance.

For my new approach, I want to add a digest column that contains a digest of all of the attributes in a particular row.

There is no security requirement for this hash, so even MD5 would be fine.

Am I better off building a simple string containing representations of all the keys and values together? Or is there a better way?

For example, given hash:

attr_hash = { attribute1: "Please",
              attribute2: nil,
              attribute3: "don't",
              attribute4: nil,
              attribute5: nil,
              attribute6: nil,
              attribute7: "immediately",
              attribute8: "",
              attribute9: "downvote",
              attribute10: "my",
              attribute11: nil,
              attribute12: "question" }

would this be preferable (and I'm sure you'll agree this is beautiful):

attr_str = attr_hash.select{|k,v| v!="" && !v.nil?}.keys.sort.map{|k| "#{k}=#{attr_hash[k]}" }.join("^^")
digest = Digest::MD5.hexdigest(attr_str)

which gives a nice-looking string:

790470349a791b9897afd52a336ab2bb

I can index that column and get very, very fast response times from the database. And I'm unlikely to get many if any collisions from that. And if there's a collision one in 5 or 10 million times, it's fine.

I deeply appreciate any insights.


Solution

  • Lazy way:

    Digest::SHA2.hexdigest(attr_hash.inspect)
    

    Where that pre-supposes your items have identical ordering. If you need to sort the items first:

    Digest::SHA2.hexdigest(attr_hash.to_a.sort_by { |k, _v| k }.inspect)
    

    I'd use JSON.dump(x) instead of x.inspect if I wanted something more portable, like to non-Ruby code-bases.

    I also wouldn't bother stripping out empty values. The hash function doesn't care.