I am looking for a hash-kind algorithm that does not provide any security but rather a fixed and distinct pattern for a string, in such a way that a near similar string can be correlated using Leveinshtein distance calculation or any distance metric.
Let's say I have two strings "hello/friend/my?" and "hello/friend/my", and I calculate the distance (Levenshtein) without and with hash in python:
>>> import Levenshtein as lev
>>> Str1 = "hello/friend/my?"
>>> Str2 = "hello/friend/my"
>>> Distance = lev.distance(Str1.lower(),Str2.lower()),
>>> print(Distance)
>>> Ratio = lev.ratio(Str1.lower(),Str2.lower())
>>> print(Ratio)
(1,)
0.967741935483871
>>> Str1hash = hash(Str1)
>>> Str2hash = hash(Str2)
>>> Distance = lev.distance(str(Str1hash), str(Str2hash)),
>>> print(Distance)
>>> Ratio = lev.ratio(str(Str1hash), str(Str2hash))
>>> print(Ratio)
(16,)
0.41025641025641024
You can see that the values generated without hash, shows a closer distance (1) and with hash the distance is too far (16).
I would like to find a hash-kind of function or algorithm that returns a closer distance and ratio between similar strings. Any clue?
The solution I wanted is LSH: https://en.wikipedia.org/wiki/Locality-sensitive_hashing
It solves the question, I posed. It's a technique used in Information Retrieval to find duplicates documents or web pages. Thus I can use the same to compare my two strings and get their similarity index.