objective-c string algorithm levenshtein-distance

Comparing Multiple Word names with Levenshtein Distances

I'm comparing building names on my campus with input from various databases. People entered these names, and everyone uses their own abbreviation scheme. I'm trying to find the best match from a user input to a canonical form of the name.

I've implemented a recursive Levenshtein Distance method, but there are a few edge cases I'm trying to tackle. My implementation is on GitHub.

Some of the building names are one word, while others are two. A single word on a single word produces fairly accurate results, but there are two things that I need to keep in mind.

Abbreviations: Assuming an input is a shortened version of a name, I can sometimes get the same Levenshtein Distance between the input and an arbitrary name, as well as the correct name. For example, if my input is "Ing" and the building names^1. are ["Boylan", "Ingersoll", "Whitman", "Whitehead", "Roosevelt", and "Library"], I end up with a LD of 6 for both Boylan and Ingersoll. The desired result is here Ingersoll.
Multiword Strings: The second edge cases is when the input and/or output is two words, separated by a space. For example, New Ing is an abbreviation for New Ingersoll. In this case, New Ingersoll and Boylan both score a Levenshtein Distance of 6. If I were to split the strings, New matches New perfectly, and then I just have to refer back to the solution to my previous edge case.

What's the best way to handle these two edge cases?

_{1. For the curious, these are the buildings at Brooklyn College, in New York City.}

Solution

I think you should use the length of the Longest Common Subsequence instead of the Levenshtein Distance. That seems to be a better metric for your case. In essence, it prioritizes insertions and deletions over substitutions, as I suggested in my comment.

It clearly distiguishes between "Ing" -> "Ingersoll" and "Ing" -> "Boylan" (scores of 3 and 1) handles spaces without a problem ("New Ing" -> "New Ingersoll" scores 7 where "New Ing" -> "Boylan" again scores 1), and will also work nicely should you come across an abbreviation like "Ingsl".

The algorithm is straightforward. Where your two strings have length m and n, compare successive prefixes of the strings characterwise (starting with the empty prefixes), keeping scores in a matrix of size m+1, n+1. If a particular pair matches, add one to the score of the previous two prefixes (one row up and one column left in the matrix); otherwise keep the highest of the two scores of those prefixes (compare the entry immediately above and the entry immediately left and take the best). When you've gone through both strings, the last entry in the score matrix is the length of the LCS.

Example score matrix for "Ingsll" and "Ingersoll":

      0 1 2 3 4 5 6
        I n g s l l
    ---------------
0   | 0 0 0 0 0 0 0
1 I | 0 1 1 1 1 1 1
2 n | 0 1 2 2 2 2 2
3 g | 0 1 2 3 3 3 3
4 e | 0 1 2 3 3 3 3
5 r | 0 1 2 3 3 3 3
6 s | 0 1 2 3 4 4 4
7 o | 0 1 2 3 4 4 4
8 l | 0 1 2 3 4 5 5
9 l | 0 1 2 3 4 5 6

Here's an ObjC implementation of the length. Most of the complexity here is just due to wanting to handle composed character sequences -- @"o̶" for example -- correctly.

#import <Foundation/Foundation.h>

@interface NSString (WSSComposedLength)

- (NSUInteger)WSSComposedLength;

@end

@implementation NSString (WSSComposedLength)

- (NSUInteger)WSSComposedLength
{
    __block NSUInteger length = 0;
    [self enumerateSubstringsInRange:(NSRange){0, [self length]}
                             options:NSStringEnumerationByComposedCharacterSequences | NSStringEnumerationSubstringNotRequired
                          usingBlock:^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) {
                              length++;
                          }];

    return length;
}

@end


@interface NSString (WSSLongestCommonSubsequence)

- (NSUInteger)WSSLengthOfLongestCommonSubsequenceWithString:(NSString *)target;
- (NSString *)WSSLongestCommonSubsequenceWithString:(NSString *)target;

@end

@implementation NSString (WSSLongestCommonSubsequence)

- (NSUInteger)WSSLengthOfLongestCommonSubsequenceWithString:(NSString *)target
{
    NSUInteger * const * scores;
    scores = [[self scoreMatrixForLongestCommonSubsequenceWithString:target] bytes];

    return scores[[target WSSComposedLength]][[self WSSComposedLength]];
}

- (NSString *)WSSLongestCommonSubsequenceWithString:(NSString *)target
{
    NSUInteger * const * scores;
    scores = [[self scoreMatrixForLongestCommonSubsequenceWithString:target] bytes];

    //FIXME: Implement this.

    return nil;
}

- (NSData *)scoreMatrixForLongestCommonSubsequenceWithString:(NSString *)target{

    NSUInteger selfLength = [self WSSComposedLength];
    NSUInteger targetLength = [target WSSComposedLength];
    NSMutableData * scoresData = [NSMutableData dataWithLength:(targetLength + 1) * sizeof(NSUInteger *)];
    NSUInteger ** scores = [scoresData mutableBytes];

    for( NSUInteger i = 0; i <= targetLength; i++ ){
        scores[i] = [[NSMutableData dataWithLength:(selfLength + 1) * sizeof(NSUInteger)] mutableBytes];
    }

    /* Ranges in the enumeration Block are the same measure as
     * -[NSString length] -- i.e., 16-bit code units -- as opposed to
     * _composed_ length, which counts code points. Thus:
     *
     * Enumeration will miss the last character if composed length is used
     * as the range and there's a substring range with length greater than one.
     */
    NSRange selfFullRange = (NSRange){0, [self length]};
    NSRange targetFullRange = (NSRange){0, [target length]};
    /* Have to keep track of these indexes by hand, rather than using the
     * Block's substringRange.location because, e.g., @"o̶", will have
     * range {x, 2}, and the next substring will be {x+2, l}.
     */
    __block NSUInteger col = 0;
    __block NSUInteger row = 0;
    [target enumerateSubstringsInRange:targetFullRange
                             options:NSStringEnumerationByComposedCharacterSequences
                          usingBlock:^(NSString * targetSubstring,
                                       NSRange targetSubstringRange,
                                       NSRange _, BOOL * _0)
        {
            row++;
            col = 0;

            [self enumerateSubstringsInRange:selfFullRange
                                     options:NSStringEnumerationByComposedCharacterSequences
                                  usingBlock:^(NSString * selfSubstring,
                                               NSRange selfSubstringRange,
                                               NSRange _, BOOL * _0)
                {
                    col++;
                    NSUInteger newScore;
                    if( [selfSubstring isEqualToString:targetSubstring] ){

                        newScore = 1 + scores[row - 1][col - 1];
                    }
                    else {
                        NSUInteger upperScore = scores[row - 1][col];
                        NSUInteger leftScore = scores[row][col - 1];
                        newScore = MAX(upperScore, leftScore);
                    }

                    scores[row][col] = newScore;
                }];
        }];

    return scoresData;
}

@end

int main(int argc, const char * argv[])
{

    @autoreleasepool {

        NSArray * testItems = @[@{@"source" : @"Ingso̶ll",
                                  @"targets": @[
                                    @{@"string"   : @"Ingersoll",
                                      @"score"    : @6,
                                      @"sequence" : @"Ingsll"},
                                    @{@"string"   : @"Boylan",
                                      @"score"    : @1,
                                      @"sequence" : @"n"},
                                    @{@"string"   : @"New Ingersoll",
                                      @"score"    : @6,
                                      @"sequence" : @"Ingsll"}]},
                                @{@"source" : @"Ing",
                                  @"targets": @[
                                         @{@"string"   : @"Ingersoll",
                                           @"score"    : @3,
                                           @"sequence" : @"Ing"},
                                         @{@"string"   : @"Boylan",
                                           @"score"    : @1,
                                           @"sequence" : @"n"},
                                         @{@"string"   : @"New Ingersoll",
                                           @"score"    : @3,
                                           @"sequence" : @"Ing"}]},
                                @{@"source" : @"New Ing",
                                  @"targets": @[
                                         @{@"string"   : @"Ingersoll",
                                           @"score"    : @3,
                                           @"sequence" : @"Ing"},
                                         @{@"string"   : @"Boylan",
                                           @"score"    : @1,
                                           @"sequence" : @"n"},
                                         @{@"string"   : @"New Ingersoll",
                                           @"score"    : @7,
                                           @"sequence" : @"New Ing"}]}];

        for( NSDictionary * item in testItems ){
            NSString * source = item[@"source"];
            for( NSDictionary * target in item[@"targets"] ){
                NSString * targetString = target[@"string"];
                NSCAssert([target[@"score"] integerValue] ==
                           [source WSSLengthOfLongestCommonSubsequenceWithString:targetString],
                          @"");
//                NSCAssert([target[@"sequence"] isEqualToString:
//                           [source longestCommonSubsequenceWithString:targetString]],
//                          @"");
            }
        }


    }
    return 0;
}