I have postgreSQL 9.2
My task is to find similar names in table (limited by some levenshtain distance).
For example, the distance is 3, the table has data:
| name |
|***************************|
| Marcus Miller |
| Marcos Miller |
| Macus Miler |
| David Bowie |
| Dave Grohl |
| Dav Grol |
| ... |
The result I want to get is like this:
| Marcus Miller, Marcos Miller, Macus Miler |
| Dave Grohl, Dav Grol |
| ... |
Or
| Marcus Miller, Marcos Miller |
| Marcus Miller, Macus Miler |
| Dave Grohl, Dav Grol |
| ... |
I tried this:
SELECT a.name, b.name
FROM my_table a
JOIN my_table b ON b.id < a.id AND levenshtein(b.name, a.name) < 3;
But it is too slow with my datum.
There is a significant conceptual error with your question; GROUP BY
takes certain equivalence relations (in the mathematical sense) as an argument and uses that to partition a SQL relation into equivalence classes.
The problem is that the relation you describe, namely, "are two strings within a certain edit distance of each other", is not an equivalence relation. It's symmetric and reflexive, but not transitive. To illustrate, what should the answer be if I added a series of names to your dataset that morphed "Marcus Miller" into "Dave Grohl", with each name in the series being within that edit distance from the previous?
However, there are algorithms to partition a data set using things that aren't equivalence relations, such as geometrical distance. K-means clustering is one of the best known examples. Perhaps there is a way to adapt k-means or something similar to this problem, I don't know.