I have two TreeMaps and I want to check if they contain at least one identical key (the keys are Strings). So I use two loops for comparison:
boolean found = false;
for(String key1 : map1.keySet()){
for(String key2 : map2.keySet()){
if(key1.equals(key2)){
found = true;
break;
}
}
if(found){
break;
}
}
if(found){
someFunction(map1, map2);
}
As I have 500,000 TreeMaps (with about 1000 keys each) and I want to check each map against each other map, it takes a long time. Does anyone know a faster solution?
*Edit: I want to call the "someFunction()"-method every time I find two maps with at leat one same key. I think in >90% of all cases found == false
One way you could try is to make a multimap of key->maps, i.e. iterate over all 500k maps and add them for each key they contain.
Then iterate over the keys again and if there are two or more maps for a key, those maps share it.
With that approach complexity should drop from O(n² * m)
to O(n * m)
(n
being the number of maps and m
being the number of keys).
Rough outline:
Multimap<Key, Map<Key, Value>> mapsContainingKey = ... ;//could be a Guava Multimap
//O(n * m) complexity
for(Map<Key, Value> m : largeSetOfTreeMaps ) {
for(Key k : m.keySet() ) {
mapsContainingKey.put( k, m );
}
}
//O(m)
for( Entry<Key, Map<Key, Value>> entry : mapsContainingKey.entries() ) {
Key key = entry.getKey();
Collection<Map<Key, Value>> mapsWithSameKey = entry.getValue();
if( mapsWithSameKey.size() > 1 ) {
//all maps in that collection share this key
}
}
Update: I ran a quick benchmark and though it is not optimized there's a clear trend:
The "naive" approach is looping over all maps and checking against all following maps so that each pair is only checked once. Additionally I applied what Holger suggested for comparing two maps.
The "map" approach is what I posted here.
Results on my machine for 1000 maps with each having 100 random String keys of length 10:
naive: 11656 ms
map: 235 ms
Update 2: Some more results with different sizes:
1000 maps with 100 keys of varying length (the longer the keys, the less collisions)
key length 1 2 3 4 5 10 20
naive 417 ms 3221 ms 10937 ms 11273 ms 11357 ms 11383 ms 11706 ms
map 16 ms 43 ms 86 ms 224 ms 245 ms 210 ms 154 ms
1000 maps with varying number of keys each and key length 10 (the more keys, the more collisions)
key count 50 100 500
naive 4865 ms 11368 ms 81280 ms
map 64 ms 206 ms 913 ms
Varying number of maps with 1000 keys each and key length 10 (the more maps, the more collisions)
map count 500 1000 2000
naive 6323 ms 12766 ms 47798 ms
map 139 ms 206 ms 333 ms
As you can see, the number of maps has the most influence on this followed by the number of keys.