Search code examples
sqlindexingocrsimilarity

Is it possible to create indexes for a string/uuid-based primary key to be able to fast search by similarity (e.g. noisy uuids)?


I will give the concrete case for better comprehension. I have some codes that I will call here UUID coming from OCR.

From the, say, 25 characters, a few are misrecognized. Is it possible to "index by similarity" the UUID column in a SQL database?

Will a SELECT ... LIKE statement already have a good behavior, supposing only one character is wrong per UUID and I perform 25 queries?

[The noisy uuid is not going to be inserted, just SELECTed.]


Solution

  • I'm sorry, i don't know if there is a built in funtion to do so but what you are trying to do is an algorithm called Levenshtein distance. Have a look at that :

    Definition : https://en.wikipedia.org/wiki/Levenshtein_distance#:~:text=Informally%2C%20the%20Levenshtein%20distance%20between,considered%20this%20distance%20in%201965.

    Using SQL : https://lucidar.me/en/web-dev/levenshtein-distance-in-mysql/#:~:text=Informally%2C%20the%20Levenshtein%20distance%20between,not%20match%20exactly%20the%20fields.