Search code examples
javasettreemap

fast way to check if two sets contain at least one same element


I have two TreeMaps and I want to check if they contain at least one identical key (the keys are Strings). So I use two loops for comparison:

boolean found = false;
for(String key1 : map1.keySet()){
    for(String key2 : map2.keySet()){
        if(key1.equals(key2)){
            found = true;
            break;
        }
    }
    if(found){
        break;
    }
}
if(found){
    someFunction(map1, map2);
}

As I have 500,000 TreeMaps (with about 1000 keys each) and I want to check each map against each other map, it takes a long time. Does anyone know a faster solution?

*Edit: I want to call the "someFunction()"-method every time I find two maps with at leat one same key. I think in >90% of all cases found == false


Solution

  • One way you could try is to make a multimap of key->maps, i.e. iterate over all 500k maps and add them for each key they contain.

    Then iterate over the keys again and if there are two or more maps for a key, those maps share it.

    With that approach complexity should drop from O(n² * m) to O(n * m) (n being the number of maps and m being the number of keys).

    Rough outline:

    Multimap<Key, Map<Key, Value>> mapsContainingKey = ... ;//could be a Guava Multimap
    //O(n * m) complexity
    for(Map<Key, Value> m : largeSetOfTreeMaps ) {
      for(Key k : m.keySet() ) {
        mapsContainingKey.put( k, m );
      }
    }
    
    //O(m)
    for( Entry<Key, Map<Key, Value>> entry : mapsContainingKey.entries() ) {
      Key key = entry.getKey();
      Collection<Map<Key, Value>> mapsWithSameKey = entry.getValue();
      if( mapsWithSameKey.size() > 1 ) {
        //all maps in that collection share this key
      }
    }
    

    Update: I ran a quick benchmark and though it is not optimized there's a clear trend:

    The "naive" approach is looping over all maps and checking against all following maps so that each pair is only checked once. Additionally I applied what Holger suggested for comparing two maps.

    The "map" approach is what I posted here.

    Results on my machine for 1000 maps with each having 100 random String keys of length 10:

    naive: 11656 ms
    map:     235 ms
    

    Update 2: Some more results with different sizes:

    1000 maps with 100 keys of varying length (the longer the keys, the less collisions)

    key length   1        2         3         4         5        10        20
    naive      417 ms  3221 ms  10937 ms  11273 ms  11357 ms  11383 ms  11706 ms
    map         16 ms    43 ms     86 ms    224 ms    245 ms    210 ms    154 ms
    

    1000 maps with varying number of keys each and key length 10 (the more keys, the more collisions)

    key count    50       100       500
    naive      4865 ms  11368 ms  81280 ms 
    map          64 ms    206 ms    913 ms
    

    Varying number of maps with 1000 keys each and key length 10 (the more maps, the more collisions)

    map count    500     1000      2000
    naive      6323 ms  12766 ms  47798 ms 
    map         139 ms    206 ms    333 ms
    

    As you can see, the number of maps has the most influence on this followed by the number of keys.