How to make a hash key that would match/compare to the hash similar text/html?

I would like to make a sort of hash key out of a text (in my case html) that would match/compare to the hash of other similar text

ex of matching texts:

"2012/10/01 This is my webpage #1"+ 100k_of_same_text + random_words_1 + ..
"2012/10/02 This is my webpage #2"+ 100k_of_same_text + random_words_2 + ..
...
"2012/10/02 This is my webpage #2"+ 100k_of_same_text + random_words_3 + ..

So far I've thought of removing numbers and tags but that wold still leave the random words.

Is there anything out there that dose this?

I have root access to the server so I can add any UDF that is necesare and if needed I can do the processing in c or other languages.

The ideal would be a function like generateSimilarHash(text) and an other function compareSimilarHashes(hash1,hash2) that would return the procent of matching text.

Any function like compare(text1,text2) would not work as in my case as I have many pages to compare (~20 mil at the moment)

Any advice is welcomed!

Solution

I've never had to do anything quite like this, so just throwing something out there based on general hashing knowledge.

First, in general, I doubt you can represent the entire string you want to compare as one value hashed therefrom, then meaningfully find approximate matches using just that. Hashing functions are generally designed to produce a huge pseudo-random difference in output value from the tiniest change in input value - so used naively they're not a good match for this problem, but...

What might work is using some convention for breaking long text into subsections, such as looking for terminating punctuation (full stop, exclamation mark, question mark) at least N characters apart, then you could hash those individual substrings and use a count of matching hashes to approximate the amount of matching text.

You'd have to work out a suitable level of granularity to divide the text into a reasonable number of distinct hashes - balancing the size of your hashes and speed of hash-comparisons against the accuracy of matches. You may also want to do some prior transforms, such as converting characters to a single case or replacing each area of one or more whitespace characters with a single space, perhaps replacing punctuation with a space: that way trivial differences won't cause hashes to mismatch - tune to taste.

In your example:

"2012/10/01 This is my webpage #1"+ 100k_of_same_text + random_words_1 + ..

Say you break on full-stops, or sans full-stops we find local minima in sorted word order such that max 5-20 words appear in a section... you might end up with substrings like:

"2012/10/01 This is my webpage #1."
"This is the first bit from 100k of text."
"This is the second bit from 100k of text."
"Yet another bit from the 100k."
"chicken book dog crayon stick hug"
- break due to "apple" being local min
"apple twig paper glove bookend ibm"
- break due to "activation" being local min
"activation usurper triad monkey wrench."
- break on "."
"zebra italy quark stew century dinosaur jacket egg trick"
- break due to "chicken" being local min; "century" is < 5 words in
"chicken joke road bad"

Then you use a normal string hashing function on each of the above. To compare this to other similarly-hashed text, you look for the number of matching hash values (if you don't assign some importance to the order or contiguousness of matching subsections of text, it's pretty efficient to iterate over pre-sorted lists of both sets of hashes, or prepopulate hash tables with the hash values then seek each in turn).