Search code examples
javaperformancecollections

Efficient way to add elements of an array to HashSet


I have a HashSet. I need to split a String and add the elements of the string to the HashSet. I wanted to know which one is efficient way of doing this.

 Collections.addAll(myHashSet, myString.split(":"));
 myHashSet.addAll(List.of(myString.split(":")));

In 2nd option, I see that we are creating extra object of List to create list out of array. But, we are not creating any extra object in 1st approach. So, I think that 1st approach is more efficient. But, I wanted to check.


Solution

  • tl;dr

    Collections.addAll(myHashSet, myString.split(":")) is best.

    Other options are very close though, even for very large data sets.

    Explanation

    For small elements it really does not matter at all, do not underestimate how strong your computer is.

    For bigger stuff, you have to take care to not make any accidental copies or similar:

    • List.of creates an extra copy of the whole data just to make it list, bad
    • Arrays.asList creates a fake list wrapper around the array, no copies, good
    • Collections.addAll(...) adds directly from the array, good

    Now, there is also splitAsStream, which avoids the array to begin with by directly adding during the splitting. Theoretically, thats the best. But in practice, the streams come with a lot of overhead.

    It is your best option if the data is larger than your RAM, i.e. if you can not afford creating the array. That said, chances are you could not afford holding the String to begin with then.


    Benchmarks

    Instead of making educated guesses, you should actually run JMH benchmarks and get reliable results. So here they are.

    N is the size of the arrays, so how many strings there are.

    The lower the score, the better.

    Small (20)

    Benchmark          (N)  (SEED)  Mode  Cnt   Score    Error  Units
    addSplitAsStream    20     123  avgt    5   0,001 ±  0,001  ms/op
    collectionsAddAll   20     123  avgt    5  ≈ 10⁻³           ms/op
    setAddAllArrays     20     123  avgt    5  ≈ 10⁻³           ms/op
    setAddAllList       20     123  avgt    5  ≈ 10⁻³           ms/op
    

    Medium (200)

    Benchmark          (N)  (SEED)  Mode  Cnt  Score    Error  Units
    addSplitAsStream   200     123  avgt    5  0,007 ±  0,001  ms/op
    collectionsAddAll  200     123  avgt    5  0,004 ±  0,001  ms/op
    setAddAllArrays    200     123  avgt    5  0,004 ±  0,001  ms/op
    setAddAllList      200     123  avgt    5  0,004 ±  0,001  ms/op
    

    Big (10_000) [EDIT]

    As per comments, this benchmark includes two new cases forEachAdd and addMatcherLoop.

    Benchmark          (N)    (SEED)  Mode  Cnt  Score   Error  Units
    addMatcherLoop     10000     123  avgt    5  0,730 ± 0,017  ms/op
    addSplitAsStream   10000     123  avgt    5  0,947 ± 0,015  ms/op
    collectionsAddAll  10000     123  avgt    5  0,327 ± 0,005  ms/op
    forEachAdd         10000     123  avgt    5  0,334 ± 0,018  ms/op
    setAddAllArrays    10000     123  avgt    5  0,329 ± 0,002  ms/op
    setAddAllList      10000     123  avgt    5  0,346 ± 0,011  ms/op
    

    Large (1_000_000) [EDIT]

    As per comments, this benchmark includes two new cases forEachAdd and addMatcherLoop.

    Benchmark          (N)      (SEED)  Mode  Cnt    Score     Error  Units
    addMatcherLoop     1000000     123  avgt    5  184,229 ±  14,699  ms/op
    addSplitAsStream   1000000     123  avgt    5  222,011 ±  55,131  ms/op
    collectionsAddAll  1000000     123  avgt    5  164,837 ± 189,569  ms/op
    forEachAdd         1000000     123  avgt    5  198,915 ±   9,983  ms/op
    setAddAllArrays    1000000     123  avgt    5  114,747 ±  59,868  ms/op
    setAddAllList      1000000     123  avgt    5  115,520 ±   4,620  ms/op
    

    Benchmark code

    Full JMH test. Adjust N for the array sizes.

    import org.openjdk.jmh.annotations.*;
    import org.openjdk.jmh.infra.Blackhole;
    import org.openjdk.jmh.runner.Runner;
    import org.openjdk.jmh.runner.RunnerException;
    import org.openjdk.jmh.runner.options.Options;
    import org.openjdk.jmh.runner.options.OptionsBuilder;
    
    import java.util.*;
    import java.util.concurrent.TimeUnit;
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    import java.util.stream.Collectors;
    import java.util.stream.Stream;
    
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.MILLISECONDS)
    @State(Scope.Benchmark)
    @Fork(value = 1, jvmArgs = {"-Xms2G", "-Xmx2G"})
    @Warmup(iterations = 2)
    @Measurement(iterations = 5)
    public class BenchmarkHashSetAddAll {
    
        @Param("1000000")
        private int N;
        @Param("123")
        private long SEED;
    
        private String DATA;
        private HashSet<String> TARGET;
    
        public static void main(String[] args) throws RunnerException {
            Options options =
                    new OptionsBuilder().include(BenchmarkHashSetAddAll.class.getSimpleName()).build();
    
            new Runner(options).run();
        }
    
        @Setup
        public void setup(Blackhole blackhole) {
            Random rnd = new Random(SEED);
    
            DATA = Stream.generate(rnd::nextInt)
                    .limit(N)
                    .map(String::valueOf)
                    .collect(Collectors.joining(":"));
            TARGET = new HashSet<>();
        }
    
        @Benchmark
        public void collectionsAddAll(Blackhole blackhole) {
            Collections.addAll(TARGET, DATA.split(":"));
            blackhole.consume(TARGET);
        }
    
        @Benchmark
        public void setAddAllList(Blackhole blackhole) {
            TARGET.addAll(List.of(DATA.split(":")));
            blackhole.consume(TARGET);
        }
    
        @Benchmark
        public void setAddAllArrays(Blackhole blackhole) {
            TARGET.addAll(Arrays.asList(DATA.split(":")));
            blackhole.consume(TARGET);
        }
    
        @Benchmark
        public void addSplitAsStream(Blackhole blackhole) {
            Pattern.compile(":").splitAsStream(DATA).forEach(TARGET::add);
            blackhole.consume(TARGET);
        }
    
        @Benchmark
        public void forEachAdd(Blackhole blackhole) {
            for (String element : DATA.split(":")) {
                TARGET.add(element);
            }
            blackhole.consume(TARGET);
        }
    
        @Benchmark
        public void addMatcherLoop(Blackhole blackhole) {
            Matcher matcher = Pattern.compile(":").matcher(DATA);
    
            int current = 0;
            while (matcher.find()) {
                String element = DATA.subSequence(current, matcher.start()).toString();
                current = matcher.end();
                TARGET.add(element);
            }
            blackhole.consume(TARGET);
        }
    }