Search code examples
objective-cnsarrayblockduplicates

Optimizing algorithm for matching duplicates


I've written a small utility program that identifies duplicate tracks in iTunes. The actual matching of tracks takes a long time, and I'd like to optimize it. I am storing track data in an NSMutableDictionary that stores individual track data in NSMutableDictionaries keyed by trackID. These individual track dictionaries have at least the following keys:

  • TrackID
  • Name
  • Artist
  • Duration (in milli ####.####)

To determine if any tracks match one another, I must check:

  • If the duration of two tracks are within 5 seconds of each other
  • Name matches
  • Artist matches

The slow way for me to do it is using two for-loops:

-(void)findDuplicateTracks {

    NSArray *allTracks = [tracks allValues];

    BOOL isMatch = NO;

    int numMatches = 0;

    // outer loop

    NSMutableDictionary *track      = nil;
    NSMutableDictionary *otherTrack = nil;

    for (int i = 0; i < [allTracks count]; i++) { 

        track = [allTracks objectAtIndex:i];

        NSDictionary *summary = nil;

        if (![claimedTracks containsObject:track]) {

            NSAutoreleasePool *aPool = [[NSAutoreleasePool alloc] init];

            NSUInteger duration1  = (NSUInteger) [track objectForKey:kTotalTime];
            NSString *nName       = [track objectForKey:knName];
            NSString *nArtist     = [track objectForKey:knArtist];


            // inner loop - no need to check tracks that have
            // already appeared in i

            for (int j = i + 1; j < [allTracks count]; j++) { 

                otherTrack = [allTracks objectAtIndex:j];

                if (![claimedTracks containsObject:otherTrack]) {

                    NSUInteger duration2 = (NSUInteger)[otherTrack objectForKey:kTotalTime];

                    // duration check
                    isMatch = (abs(duration1 - duration2) < kDurationThreshold);

                    // match name
                    if (isMatch) {

                        NSString *onName = [otherTrack objectForKey:knName];

                        isMatch = [nName isEqualToString:onName];
                    }

                    // match artist
                    if (isMatch) {

                        NSString *onArtist = [otherTrack objectForKey:knArtist];

                        isMatch = [nArtist isEqualToString:onArtist];

                    }

                    // save match data
                    if (isMatch) {

                        ++numMatches;

                        // claim both tracks
                        [claimedTracks addObject:track];
                        [claimedTracks addObject:otherTrack];

                        if (![summary isMemberOfClass:[NSDictionary class]]) {

                            [track setObject:[NSNumber numberWithBool:NO] forKey:@"willDelete"];
                            summary = [self dictionarySummaryForTrack:track];

                        }


                        [otherTrack setObject:[NSNumber numberWithBool:NO] forKey:@"willDelete"];                        
                        [[summary objectForKey:kMatches] 
                                            addObject:otherTrack];

                    }
                }
            }

            [aPool drain];
        }
    }
}

This becomes quite slow for large music libraries, and only uses 1 processor. One recommended optimization was to use blocks and process the tracks in batches (of 100 tracks). I tried that. If my code originally took 9 hours to run, it now takes about 2 hours on a quad-core. That's still too slow. But (talking above my pay grade here) perhaps there is a way to store all the values I need in a C structure that "fits on the stack" and then I wouldn't have to fetch the values from slower memory. This seems too low-level for me, but I'm willing to learn if I had an example.

BTW, I profiled this in Instruments and [NSCFSet member:] takes up 86.6% percent of the CPU time.

Then I thought I should extract all the durations into a sorted array so I would not have to look up the duration value in the dictionary. I think that is a good idea, but when I started to implement it, I wondered how to determine the best batch size.

If I have the following durations:

    2 2 3 4 5 6 6 16 17 38 59   Duration
    0 1 2 3 4 5 6  7  8  9 10   Index

Then just by iterating over the array, I know that to find matching tracks of the song at index 0, I only need to compare it against songs up to index 6. That's great, I have my first batch. But now I have to start over at index 1 only to find that it's batch should also stop at index 6 and exclude index 0. I'm assuming I'm wasting a lot of processing cycles here determining what the batch should be/the duration matches. This seemed like a "set" problem, but we didn't do much of that in my Intro to Algorithms class.

My questions are:

1) What is the most efficient way to identify matching tracks? Is it something similar to what's above? Is it using disjoint and [unified] set operations that are slightly above my knowledge level? Is it filtering arrays using NSArray? Is there an online resource that describes this problem and solution?

I am willing to restructure the tracks dictionary in whatever way (datastructure) is most efficient. I had at first thought I needed to perform many lookups by TrackID, but that is no longer the case.

2) Is there a more efficient way to approach this problem? How do you rock stars go from paragraph 1 to an optimized solution?

I have searched for the answer, longer than I care to admit, and found these interesting, but unhelpful answers:

find duplicates

Find all duplicates and missing values in a sorted array

Thanks for any help you can provide, Lance


Solution

  • My first thought is to maintain some sorted collections as indices into your dictionary so you can stop doing an O(n^2) search comparing every track to every other track.

    If you had arrays of TrackIDs sorted by duration then for any track you could do a more efficient O(log n) binary search to find tracks with durations within your 5 second tolerance.

    Even better for artist and name you can store a dictionary keyed on the artist or track name whose values are arrays of TrackIDs. Then you only need a O(1) lookup to get the set of tracks for a particular artist which should allow you to very quickly determine if there are any possible duplicates.

    Finally if you've built that sort of dictionary of titles to TrackIDs then you can go through all of it's keys and only search for duplicates when there are multiple tracks with the same title. Doing further comparisons only when there are multiple tracks with the same title should eliminate a significant percentage of the library and massively reduce your search time (down to O(n) to build the dictionary and another O(n) for a worst case search for duplicates still leaves you at O(n) rather than the O(n^2) you have now).


    If nothing else do that last optimization, the resulting performance increase should be huge for an library without a significant number of duplicates:

    NSMutableArray *possibleDuplicates = [NSMutableArray array];
    NSMutableDictionary *knownTitles = [NSMutableDictionary dictionary];
    for (NSMutableDictionary *track in [tracks allKeys]) {
        if ([knownTitles objectForKey:[track objectForKey:@"title"]] != nil) {
            [possibleDuplicates addObject:track];
        }
        else {
            [knownTitles addObject:[track objectForKey:@"TrackID"] forKey:[track objectForKey:@"title"]];
        }
    }
    //check for duplicates of the tracks in possibleDuplicates only.