Search code examples
javastreamguava

Search in large CSV as fast as Guava Splitter


Since Java 8 was released I found out I don't need over 2 MB Google Guava in my projects since I can replace most of it with plain Java. However I really liked nice Splitter API which was both quite fast at the same time. And what is most important - did splitting lazily. It seems to be replaceable with Pattern.splitAsStream. So I prepared quick test - finding a value in the middle of long string (i.e. splitting the whole string does not make sense).

package splitstream;


import com.google.common.base.Splitter;
import org.junit.Assert;
import org.junit.Test;

import java.util.StringTokenizer;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

public class SplitStreamPerfTest {

    private static final int TIMES = 1000;
    private static final String FIND = "10000";

    @Test
    public void go() throws Exception {
        final String longString = IntStream.rangeClosed(1,20000).boxed()
                .map(Object::toString)
                .collect(Collectors.joining(" ,"));

        IntStream.rangeClosed(1,3).forEach((i) -> {
            measureTime("Test " + i + " with regex", () -> doWithRegex(longString));
            measureTime("Test " + i + " with string tokenizer", () -> doWithStringTokenizer(longString));
            measureTime("Test " + i + " with guava", () -> doWithGuava(longString));
        });

    }

    private void measureTime(String name, Runnable r) {
        long s = System.currentTimeMillis();
        r.run();
        long elapsed = System.currentTimeMillis() - s;
        System.out.println("Check " + name +" took " + elapsed + " ms");
    }

    private void doWithStringTokenizer(String longString) {

        String f = null;
        for (int i = 0; i < TIMES; i++) {
            StringTokenizer st = new StringTokenizer(longString,",",false);
            while (st.hasMoreTokens()) {
                String t = st.nextToken().trim();
                if (FIND.equals(t)) {
                    f = t;
                    break;
                }
            }
        }
        Assert.assertEquals(FIND, f);
    }


    private void doWithRegex(String longString) {
        final Pattern pattern = Pattern.compile(",");
        String f = null;
        for (int i = 0; i < TIMES; i++) {
            f = pattern.splitAsStream(longString)
                    .map(String::trim)
                    .filter(FIND::equals)
                    .findFirst().orElse("");
        }
        Assert.assertEquals(FIND, f);
    }


    private void doWithGuava(String longString) {
        final Splitter splitter = Splitter.on(',').trimResults();
        String f = null;
        for (int i = 0; i < TIMES; i++) {
            Iterable<String> iterable = splitter.split(longString);
            for (String s : iterable) {
                if (FIND.equals(s)) {
                    f = s;
                    break;
                }
            }
        }
        Assert.assertEquals(FIND, f);
    }
}

The results are (after a warm-up)

Check Test 3 with regex took 1359 ms
Check Test 3 with string tokenizer took 750 ms
Check Test 3 with guava took 594 ms

How to make the Java implementation as fast as Guava? Maybe I'm doing it wrong?

Or maybe you know any tool/library as fast as Guava Splitter that does not involve pulling tons of unused classes just for this one?


Solution

  • First thing is that guava is so much more than just the Splitter, Predicate and Function - you are probably not using everything it has to offer; we use it hardcore and just hearing that makes me shiver. Anyhow, you tests are broken - in probably numerous ways. I've used JMH to test these two method just for the fun of it:

        @BenchmarkMode(org.openjdk.jmh.annotations.Mode.AverageTime) 
        @OutputTimeUnit(TimeUnit.NANOSECONDS) 
        @Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)   
        @Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS) 
        @State(Scope.Thread) public class GuavaTest {
    
        public static void main(String[] args) throws RunnerException {
            Options opt = new OptionsBuilder().include(GuavaTest.class.getSimpleName())
                    .jvmArgs("-ea", "-Xms10g", "-Xmx10g")
                    .shouldFailOnError(true)
                    .build();
            new Runner(opt).run();
        }
    
        @Param(value = { "300", "1000" })
        public String tokenToSearchFor;
    
        @State(Scope.Benchmark)
        public static class ThreadState {
            String longString = IntStream.range(1, 20000).boxed().map(Object::toString).collect(Collectors.joining(" ,"));
    
            StringTokenizer st = null;
    
            Pattern pattern = null;
    
            Splitter splitter = null;
    
            @Setup(Level.Invocation)
            public void setUp() {
                st = new StringTokenizer(longString, ",", false);
                pattern = Pattern.compile(",");
                splitter = Splitter.on(',').trimResults();
            }
        }
    
        @Benchmark
        @Fork(1)
        public boolean doWithStringTokenizer(ThreadState ts) {
            while (ts.st.hasMoreTokens()) {
                String t = ts.st.nextToken().trim();
                if (t.equals(tokenToSearchFor)) {
                    return true;
                }
            }
            return false;
        }
    
        @Benchmark
        @Fork(1)
        public boolean doWithRegex(ThreadState ts) {
            return ts.pattern.splitAsStream(ts.longString)
                    .map(String::trim)
                    .anyMatch(tokenToSearchFor::equals);
        }
    
        @Benchmark
        @Fork(1)
        public boolean doWithGuava(ThreadState ts) {
            Iterable<String> iterable = ts.splitter.split(ts.longString);
            for (String s : iterable) {
                if (s.equals(tokenToSearchFor)) {
                    return true;
                }
            }
            return false;
        }
    
    }
    

    And the results:

    Benchmark                        (tokenToSearchFor)  Mode  Cnt       Score        Error  Units
    GuavaTest.doWithGuava                           300  avgt    5   19284.192 ±  23536.321  ns/op
    GuavaTest.doWithGuava                          1000  avgt    5   67182.531 ±  93242.266  ns/op
    GuavaTest.doWithRegex                           300  avgt    5   65780.954 ± 169044.641  ns/op
    GuavaTest.doWithRegex                          1000  avgt    5  182530.069 ± 409571.222  ns/op
    GuavaTest.doWithStringTokenizer                 300  avgt    5   34111.030 ±  61014.332  ns/op
    GuavaTest.doWithStringTokenizer                1000  avgt    5  118963.048 ± 165510.183  ns/op      
    

    That makes guava the fastest indeed.

    If you add parallel to the splitAsStream then it will become interesting, a must read here