I have a very large table containing 2 billion rows of 50 attributes. Not all are filled out and it's a sparse matrix.
I dislike having to build query off of all of the values, and the indexes are much too large now. I've lost performance.
For my new approach, I want to add a digest column that contains a digest of all of the attributes in a particular row.
There is no security requirement for this hash, so even MD5 would be fine.
Am I better off building a simple string containing representations of all the keys and values together? Or is there a better way?
For example, given hash:
attr_hash = { attribute1: "Please",
attribute2: nil,
attribute3: "don't",
attribute4: nil,
attribute5: nil,
attribute6: nil,
attribute7: "immediately",
attribute8: "",
attribute9: "downvote",
attribute10: "my",
attribute11: nil,
attribute12: "question" }
would this be preferable (and I'm sure you'll agree this is beautiful):
attr_str = attr_hash.select{|k,v| v!="" && !v.nil?}.keys.sort.map{|k| "#{k}=#{attr_hash[k]}" }.join("^^")
digest = Digest::MD5.hexdigest(attr_str)
which gives a nice-looking string:
790470349a791b9897afd52a336ab2bb
I can index that column and get very, very fast response times from the database. And I'm unlikely to get many if any collisions from that. And if there's a collision one in 5 or 10 million times, it's fine.
I deeply appreciate any insights.
Lazy way:
Digest::SHA2.hexdigest(attr_hash.inspect)
Where that pre-supposes your items have identical ordering. If you need to sort the items first:
Digest::SHA2.hexdigest(attr_hash.to_a.sort_by { |k, _v| k }.inspect)
I'd use JSON.dump(x)
instead of x.inspect
if I wanted something more portable, like to non-Ruby code-bases.
I also wouldn't bother stripping out empty values. The hash function doesn't care.