Search code examples
algorithmfunctionsortinghashlevenshtein-distance

O(n) or faster algorithm for sorting a list by levenshtein distance?


Is there an O(n) or faster algorithm for sorting a list by levenshtein distance? I've looked some solutions on SO, but all of them invoke traditional sorting. Now, suppose you just sum the bytes of your input: you'll get hash keys that are pretty much sorted by their levenshtein distance. For example, I took a set of random strings and computed their hashes by byte summation:

[ { hash: 2826, val: 'LKAMFKLFUAHUHAUHAUHAU:ANGONEGANAILFJAL:' },
  { hash: 2829, val: 'LKAMFKLFLFUAHUAHUHAUAHANGONEGANAILFJAL:' },
  { hash: 2845, val: 'LKAMFKLFLFAKAKKAKAfiO:ANGONEGANAILFJAL:' },
  { hash: 3064, val: 'LKAMFKLFKKKlaNflanfiO:ANGONEGANAILFJAL:' },
  { hash: 3092, val: 'LKAMFKLFLFklaNflanfiO:ANGONEGANAILFJAL:' },
  { hash: 3203, val: 'LKAMFKLFLFklaNflanfiRSRSRSRSRRNAILFJAL:' },
  { hash: 3249, val: 'LKNFUU{N{UAFN{NF}FNPNF{FN{APNF{WNFF{NF' },
  { hash: 3843, val: 'ddddddddddaaaaaaaaaddddddddddaaaaaaaaaa' },
  { hash: 3858, val: 'safndjnjcmxn,znv,mnm,n,mvnm,vn,mznv,mvv' },
  { hash: 3934, val: 'nngnangngdsgsangnanwns.mnv.nv.xnjvnsf.,' },
  { hash: 3972, val: 'adadsadadsadadadsadsadadsadsadadsadsada' },
  { hash: 3992, val: 'adsadadadsadasdasdafadfasfdsafsafasfafd' },
  { hash: 4041, val: 'asfdsafasdfsafafasdfasdfafsdfdasfasfasf' },
  { hash: 4047, val: 'kkkkkkkkkkkdddddddddkkkkkkkkkkddddddddd' },
  { hash: 4058, val: 'jfjfjfjfjfjfjfjfjfjfjfjfjfjfjfjfjfjfjfj' },
  { hash: 4081, val: 'ioudnjkanfjfhjhfjhakfshfkjhdajhkjafhkjf' },
  { hash: 4082, val: 'ioudnjkanfjfhjhfjhakfshfkjhdakhkjafhkjf' },
  { hash: 4082, val: 'oisdnkbgjkbajkgkbgkjbkklgjklsbkbfkjafas' },
  { hash: 4090, val: 'ioudnjsanfjfhjhfjhakfshfkjhdakhkjafhkjf' },
  { hash: 4099, val: 'asldfjlkcmclmasldkkjflksajflkjaljfljlfa' },
  { hash: 4101, val: 'sidfjlasjflijflijlfjliafjdlifjlijfiljfl' },
  { hash: 4105, val: 'iousnjsanfjfhjhfjhakfshfkjhdakhkjafhkjf' },
  { hash: 4125, val: 'iousnjsanfjfhlhfjuakfshkkjhdakhkjafhkjf' },
  { hash: 4128, val: 'sadnfjnfjnjfnjsdfnjafnjkfnkfnjkansdfjkn' },
  { hash: 4143, val: 'dnsfanfjknasfjklnaskfnkfnklafnjkfnkldsn' },
  { hash: 4150, val: 'dskfoisandginsgnlgn:nglngbtbiybuburubsu' },
  { hash: 4155, val: 'afadfsfsfsdfffsfsfsfsdfsfsfsdfsfsfsfsfs' },
  { hash: 4166, val: 'kjdkljkljkljlkjkljlkjlkjlkjlkjljlkjljlk' },
  { hash: 4211, val: 'jsanjnvjksnfkjsanfuiawngingiuilugniugng' },
  { hash: 4229, val: 'kllnlknlknklnklnlnlknknklnlnlnklnlknlkn' },
  { hash: 4238, val: 'lsniorhgpwoiqutoiuieofnionofnoinfonfioa' },
  { hash: 4349, val: 'iasfioehwoptqpoituopqwtuoquweporuqiorur' },
  { hash: 4374, val: 'ioequroiqwuroiuriouroiuopriuprouqpourrq' },
  { hash: 4377, val: 'iiuouoiuoiuouoiuuououoiuououoiuououoiuo' } ]   

The result is nearly sorted, which means insertion sort could complete the job really fast (see).

If such crude experiment provided those results, then there is certainly some solution which SO is missing on it's answers. Which could it be?


Solution

  • The discussion below is my long-winded way of saying that your idea (as I understand it) cannot work in the general case. The reason? Because the Levenshtein distance between two strings of length N chould be N, but the strings have identical checksums. A reversed string, for example. Furthermore, the checksum difference between two strings with a Levenshtein distance of 1 can be 255 (or 65,536 in Unicode). With a range like that, "almost sorting," even if you could do it somehow (see below), isn't going to gain you much.

    So you've noticed a correlation between a simple checksum and Levenshtein distance. It's an obvious relationship. If the Levenshtein distance between two strings is small, then those two strings contain mostly the same characters. So computation of the simple checksum will result in very similar values. Sometimes.

    As somebody else pointed out, though, the reverse isn't true. The strings abcdef and fedcba have identical checksums, but their Levenshtein distance is fairly large for such a short string.

    This isn't true only of reversals. Consider, for example, the string 00000000. The string 0000000~ will have a much larger checksum than 11111111, even though the Lev. distance is much smaller.

    I think you'll find in the general case that the relationship between checksum and Lev. distance is ... sometimes coincidental. But let's ignore that particular problem and move on to your hypothesis about the sorting.

    As I understand it (and, truthfully, your question isn't entirely clear on this point), you want to sort a list of strings based on their Levenshtein distance. You don't say distance from what, but I'll assume that somewhere you have a starting string, S, a bunch of other strings [S1, S2, S3, etc.], and you want to sort that list of other strings by Lev. distance from S.

    Your hypothesis appears to be that computing a simple checksum for each string will allow you to do that sort more quickly.

    The problem is that once you've computed the checksums, you have to sort them. And that's going to take O(n log n) time with a traditional comparison sort (and in any case, at least O(n) time if you have a special-purpose sort). And once you've got that supposedly-almost-ordered list, you have to compute the Lev. distances anyway, and then rearrange the list order to reflect the real distances. But what's the point?

    You have to compute the Lev. distances anyway, and you will spend at least O(n) time sorting something. Why go to the extra trouble of computing and sorting checksums when you can more quickly just compute the Lev. distances and sort those?