Search code examples
pythonalgorithmhashchecksum

Leveinshtein and hash - finding one hash algorithm that results in correlation (closer distance)


I am looking for a hash-kind algorithm that does not provide any security but rather a fixed and distinct pattern for a string, in such a way that a near similar string can be correlated using Leveinshtein distance calculation or any distance metric.

Let's say I have two strings "hello/friend/my?" and "hello/friend/my", and I calculate the distance (Levenshtein) without and with hash in python:

>>> import Levenshtein as lev
>>> Str1 = "hello/friend/my?"
>>> Str2 = "hello/friend/my"
>>> Distance = lev.distance(Str1.lower(),Str2.lower()),
>>> print(Distance)
>>> Ratio = lev.ratio(Str1.lower(),Str2.lower())
>>> print(Ratio)

(1,)

0.967741935483871

>>> Str1hash = hash(Str1)
>>> Str2hash = hash(Str2)
>>> Distance = lev.distance(str(Str1hash), str(Str2hash)),
>>> print(Distance)
>>> Ratio = lev.ratio(str(Str1hash), str(Str2hash))
>>> print(Ratio)

(16,)

0.41025641025641024

You can see that the values generated without hash, shows a closer distance (1) and with hash the distance is too far (16).

I would like to find a hash-kind of function or algorithm that returns a closer distance and ratio between similar strings. Any clue?


Solution

  • The solution I wanted is LSH: https://en.wikipedia.org/wiki/Locality-sensitive_hashing

    It solves the question, I posed. It's a technique used in Information Retrieval to find duplicates documents or web pages. Thus I can use the same to compare my two strings and get their similarity index.