Search code examples
javasortingjava-streamgroupingby

sorting Lists after groupingBy


I am wondering, if there is already an implemented feature in streams (or Collectors) which has sorted Lists as values. E.g. the following codes both produce gender-grouped Lists of persons, sorted by age. The first solution has some overhead sorting (and looks a little bit scruffy). The second solution needs to look at every person twice but does the job in a pretty way.

First sorting then grouping in one stream:

Map<Gender, List<Person>> sortedListsByGender = (List<Person>) roster
        .stream()
        .sorted(Person::compareByAge)
        .collect(Collectors.groupingBy(Person::getGender));

First grouping, then sorting every value:

Map<Gender, List<Person>> sortedListsByGender = (List<Person>) roster
        .stream()
        .collect(Collectors.groupingBy(Person::getGender));
sortedListsByGender.values()
        .forEach(list -> Collections.sort(list, Person::compareByAge));

I am just wondering, if there is already something implemented, which does this in one run, like groupingBySorted.


Solution

  • When using sorted(comparator) on the stream before the collect operation, the stream has to buffer the entire stream contents to be able to sort it and the sorting may involve much more data movement within that buffer, compared to sorting the smaller lists of the groups afterwards. So the performance is not as good as sorting the individual groups though the implementation will utilize multiple cores if parallel processing has been enabled.

    But note that using sortedListsByGender.values().forEach(…) is not a parallelizable operation and even using sortedListsByGender.values().parallelStream().forEach(…) would only allow parallel processing of groups while each sort operation still is sequential.

    When performing the sort operation within a collector as in

    static <T> Collector<T,?,List<T>> toSortedList(Comparator<? super T> c) {
        return Collectors.collectingAndThen(
            Collectors.toCollection(ArrayList::new), l->{ l.sort(c); return l; } );
    }
    

     

    Map<Gender, List<Person>> sortedListsByGender = roster.stream()
        .collect(Collectors.groupingBy(Person::getGender, toSortedList(Person::compareByAge)));
    

    the sort operation behaves the same (thanks to Tagir Valeev for correcting me), but you can easily check how a sort-on-insertion strategy performs. Just change the collector implementation to:

    static <T> Collector<T,?,List<T>> toSortedList(Comparator<? super T> c) {
        return Collectors.collectingAndThen(
            Collectors.toCollection(()->new TreeSet<>(c)), ArrayList::new);
    }
    

    For completeness, if you want a collector which inserts sorted into an ArrayList in the first place to avoid the final copy step, you can use a more elaborated collector like this:

    static <T> Collector<T,?,List<T>> toSortedList(Comparator<? super T> c) {
        return Collector.of(ArrayList::new,
            (l,t) -> {
                int ix=Collections.binarySearch(l, t, c);
                l.add(ix<0? ~ix: ix, t);
            },
            (list1,list2) -> {
                final int s1=list1.size();
                if(list1.isEmpty()) return list2;
                if(!list2.isEmpty()) {
                    list1.addAll(list2);
                    if(c.compare(list1.get(s1-1), list2.get(0))>0)
                        list1.sort(c);
                }
                return list1;
            });
    }
    

    It’s efficient for the sequential usage but its merge function is not optimal. The underlying sort algorithm will benefit from presorted ranges but has to find these ranges first despite our merge function actually knows these ranges. Unfortunately, there’s no public API in the JRE allowing us to utilize these information (efficiently; we can pass subLists to binarySearch but creating a new sub list for each element of list2 may turn out to be too expensive). If we want to raise the performance of the parallel execution further, we have to re-implement the merge part of the sorting algorithm:

    static <T> Collector<T,?,List<T>> toSortedList(Comparator<? super T> c) {
        return Collector.of(ArrayList::new,
            (l,t) -> l.add(insertPos(l, 0, l.size(), t, c), t),
            (list1,list2) -> merge(list1, list2, c));
    }
    static <T> List<T> merge(List<T> list1, List<T> list2, Comparator<? super T> c) {
        if(list1.isEmpty()) return list2;
        for(int ix1=0, ix2=0, num1=list1.size(), num2=list2.size(); ix2<num2; ix2++, num1++) {
            final T element = list2.get(ix2);
            ix1=insertPos(list1, ix1, num1, element, c);
            list1.add(ix1, element);
            if(ix1==num1) {
                while(++ix2<num2) list1.add(list2.get(ix2));
                return list1;
            }
        }
        return list1;
    }
    static <T> int insertPos(
        List<? extends T> list, int low, int high, T t, Comparator<? super T> c) {
        high--;
        while(low <= high) {
            int mid = (low+high)>>>1, cmp = c.compare(list.get(mid), t);
            if(cmp < 0) low = mid + 1;
            else if(cmp > 0) high = mid - 1;
            else {
                mid++;
                while(mid<=high && c.compare(list.get(mid), t)==0) mid++;
                return mid;
            }
        }
        return low;
    }
    

    Note that this last solution, unlike the simple binarySearch based insertion, is a stable sort implementation, i.e. in your case, Persons with the same age and Gender won’t change their relative order, if the source stream has a defined encounter order.