Search code examples
javastreamcollectors

Java Collectors Streaming filter and .toMap when detecting case insensitve duplicates for various types


I have a data structure that collates potential case insensitive naming clashes.

  • caseInsensitiveDuplicates

  • Think of the nested maps as a way of doing a compound key.

  • The integer represents a type of data that may have duplicates,

  • the first String is the uppercase version of the string

  • the set that follows could contain any number of versions.. .. so JEREMY: ['Jeremy', 'jeremy','JEREMY'] etc is plausible data.

The goal is to identify when the Set contains more than one entry. Upper and lowercase versions of data can co-exist, and I have to identify those cases. Hence this data structure.

  • so the data N1 and n1 will be two entries keyed on the uppercase N1, and I am looking to get that back in the results.

There is a call to filter this via Streams:

  • I have to work on EntrySet to keep keys/values together. I know that much.
  • I want to return the same data structure I started out with (the type of caseInsensitiveDuplicates)
  • I know I need to filter on the size exceeding 1.

(My actual code has an enum where Integer is, and a custom class where String is within the Set on the line where it's declared. See code below).

From initial data like so:

1 : N1 : [n1, N1]
1 : N2 : [n2]

My expected result would be a data structure with data like so:

1 : N1 : [n1, N1]

Here is a link to a runnable version of the code

My initial attempt in the filter() part of the Stream was to use:

e -> e.getValue().values().size() > 1

That just returns everything

caseInsensitiveDuplicates.keySet().size(): 1
t1Dups.keySet().size(): 2
k: N1
v: N1
v: n1
k: N2
v: n2
N1 size: 2
N2 size: 1
---
k:N1
v:N1
v:n1
k:N2
v:n2

@Eritrean indicated the latest modification

e -> e.getValue().values().stream().allMatch(set -> set.size() > 1)

along with the syntactic sugar for the Collectors.toMap()

Thanks for that part.

From an attempt at debugging, the code currently prints:

caseInsensitiveDuplicates.keySet().size(): 1
t1Dups.keySet().size(): 2
k: N1
v: N1
v: n1
k: N2
v: n2
N1 size: 2
N2 size: 1
dupsAllTypes keyset was empty

I have posted a brute force non streaming solution here. But still would like to see if this is solvable in a more efficient way with Streams/Filters.


import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import java.util.TreeMap;
import java.util.TreeSet;
import java.util.stream.Collectors;
import static java.lang.System.out;
public class TestDupNameDataStructureFilter {
    public static void main(String[] args) {
        Map<Integer, Map<String, Set<String>>> caseInsensitiveDuplicates = new HashMap<>();
        Set<String> n1Set = new TreeSet<>();
        n1Set.add("n1");
        n1Set.add("N1");
        Set<String> n2Set = new TreeSet<>();
        n2Set.add("n2");
        Map<String, Set<String>> m1 = new TreeMap<>();
        m1.put("N1", n1Set);
        Map<String, Set<String>> m2 = new TreeMap<>();
        m2.put("N2", n2Set);
        Integer nmt = 1;
        caseInsensitiveDuplicates.put(nmt, m1);
        Map<String, Set<String>> temp = caseInsensitiveDuplicates.get(nmt);
        temp.put("N2", n2Set);
        caseInsensitiveDuplicates.put(nmt, temp);
        out.println("caseInsensitiveDuplicates.keySet().size(): " + caseInsensitiveDuplicates.keySet().size());
        Map<String, Set<String>> t1Dups = caseInsensitiveDuplicates.get(nmt);
        out.println("t1Dups.keySet().size(): " + t1Dups.keySet().size());
        for (String k : t1Dups.keySet()) {
            out.println("k: " + k);
            for (String v : t1Dups.get(k)) {
                out.println("v: " + v);
            }
        }
        out.println("N1 size: " + caseInsensitiveDuplicates.get(nmt).get("N1").size());
        out.println("N2 size: " + caseInsensitiveDuplicates.get(nmt).get("N2").size());
        Map<Integer, Map<String, Set<String>>> dupsAllTypes = caseInsensitiveDuplicates
        .entrySet()
        .stream()
        //.filter( e -> e.getValue().values().size() > 1)
        .filter( e -> e.getValue().values().stream().allMatch(set -> set.size() > 1) )
        .collect( Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue) );
        if (dupsAllTypes == null) {
            out.println("dupsAllTypes was null");
            return;
        } else if (dupsAllTypes.keySet().size() == 0) {
            out.println("dupsAllTypes keyset was empty");
            return;
        }
        Map<String, Set<String>> dups = dupsAllTypes.get(nmt);
        if (dups == null) {
            out.println("dups was null");
            return;
        } else if (dups.keySet().size() == 0) {
            out.println("dups keyset was empty");
            return;
        }
        out.println("---");
        for (String k : dups.keySet()) {
            out.println("k:" + k);
            for (String v : dups.get(k) ) {
                out.println("v:" + v);
            }
        }
    }
}

Solution

  • A Collection API solution can be expressed as simple as

    Map<Integer, Map<String, Set<String>>> dupsAllTypes = new HashMap<>();
    caseInsensitiveDuplicates.forEach((k1,kv) ->
        kv.forEach((k2,v) -> {
            if(v.size() > 1)
                dupsAllTypes.computeIfAbsent(k1, key -> new TreeMap<>()).put(k2, v);
        })
    );
    

    An equivalent Stream API solution would be

    Map<Integer, Map<String, Set<String>>> dupsAllTypes =
        caseInsensitiveDuplicates.entrySet().stream()
            .filter(e -> e.getValue().values().stream().anyMatch(s -> s.size() > 1))
            .collect(Collectors.toMap(Map.Entry::getKey,
                e -> e.getValue().entrySet().stream()
                    .filter(e2 -> e2.getValue().size() > 1)
                    .collect(Collectors.toMap(
                        Map.Entry::getKey,
                        Map.Entry::getValue,
                        (a,b) -> { throw new AssertionError(); },
                        TreeMap::new)))
        );
    

    This suffers from the lack of an API for a pair Stream, so we have to use a Stream of Map.Entry instances and insert getKey and getValue calls which isn’t as neat as Map’s forEach method.

    Further, you want a TreeMap result for the inner maps, but there is no Collector accepting a map factory without a merge function. So we have to provide a merge function despite we know that there can’t be duplicate keys, as the source is already a map. Therefore, this solution provides a (a,b) -> { throw new AssertionError(); } merge function.

    This approach requires two filter operations. We could reduce the work by performing the inner map processing first, but this requires temporary storage:

    Map<Integer, Map<String, Set<String>>> dupsAllTypes =
        caseInsensitiveDuplicates.entrySet().stream()
            .map(e -> new AbstractMap.SimpleEntry<>(e.getKey(),
                e.getValue().entrySet().stream()
                    .filter(e2 -> e2.getValue().size() > 1)
                    .collect(Collectors.toMap(
                        Map.Entry::getKey,
                        Map.Entry::getValue,
                        (a,b) -> { throw new AssertionError(); },
                        TreeMap::new))))
            .filter(e -> !e.getValue().isEmpty())
            .collect(Collectors.toMap(Map.Entry::getKey,Map.Entry::getValue));
    

    Here, the second filter is a trivial operation.

    Note that for your task, the Stream API solutions are not only more complicated, they also do not provide any performance advantage. Basically, they do the same as the Collection API approach. Due to the described API limitations, they do even a bit more. They could run in parallel, but you’d need really large data sets to get a benefit from parallel processing.


    If the original data is stored in mutable collections and you don’t need the original state afterwards, in other words, when you are allowed to modify the original collections, you can do the operation in place:

    caseInsensitiveDuplicates.values()
        .removeIf(map -> map.values().removeIf(set -> set.size() <= 1) && map.isEmpty());
    

    This removes all sets not having at least two elements and if this leaves an empty inner map as result, this inner map is removed as well.

    As a corner case, it would not remove an inner map if it was empty to begin with. If such maps shall be removed too, you have to use something like

    caseInsensitiveDuplicates.values()
        .removeIf(map -> map.isEmpty() ||
            map.values().removeIf(set -> set.size() <= 1) && map.isEmpty());