Search in large CSV as fast as Guava Splitter

Since Java 8 was released I found out I don't need over 2 MB Google Guava in my projects since I can replace most of it with plain Java. However I really liked nice Splitter API which was both quite fast at the same time. And what is most important - did splitting lazily. It seems to be replaceable with Pattern.splitAsStream. So I prepared quick test - finding a value in the middle of long string (i.e. splitting the whole string does not make sense).

package splitstream;


import com.google.common.base.Splitter;
import org.junit.Assert;
import org.junit.Test;

import java.util.StringTokenizer;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

public class SplitStreamPerfTest {

    private static final int TIMES = 1000;
    private static final String FIND = "10000";

    @Test
    public void go() throws Exception {
        final String longString = IntStream.rangeClosed(1,20000).boxed()
                .map(Object::toString)
                .collect(Collectors.joining(" ,"));

        IntStream.rangeClosed(1,3).forEach((i) -> {
            measureTime("Test " + i + " with regex", () -> doWithRegex(longString));
            measureTime("Test " + i + " with string tokenizer", () -> doWithStringTokenizer(longString));
            measureTime("Test " + i + " with guava", () -> doWithGuava(longString));
        });

    }

    private void measureTime(String name, Runnable r) {
        long s = System.currentTimeMillis();
        r.run();
        long elapsed = System.currentTimeMillis() - s;
        System.out.println("Check " + name +" took " + elapsed + " ms");
    }

    private void doWithStringTokenizer(String longString) {

        String f = null;
        for (int i = 0; i < TIMES; i++) {
            StringTokenizer st = new StringTokenizer(longString,",",false);
            while (st.hasMoreTokens()) {
                String t = st.nextToken().trim();
                if (FIND.equals(t)) {
                    f = t;
                    break;
                }
            }
        }
        Assert.assertEquals(FIND, f);
    }


    private void doWithRegex(String longString) {
        final Pattern pattern = Pattern.compile(",");
        String f = null;
        for (int i = 0; i < TIMES; i++) {
            f = pattern.splitAsStream(longString)
                    .map(String::trim)
                    .filter(FIND::equals)
                    .findFirst().orElse("");
        }
        Assert.assertEquals(FIND, f);
    }


    private void doWithGuava(String longString) {
        final Splitter splitter = Splitter.on(',').trimResults();
        String f = null;
        for (int i = 0; i < TIMES; i++) {
            Iterable<String> iterable = splitter.split(longString);
            for (String s : iterable) {
                if (FIND.equals(s)) {
                    f = s;
                    break;
                }
            }
        }
        Assert.assertEquals(FIND, f);
    }
}

The results are (after a warm-up)

Check Test 3 with regex took 1359 ms
Check Test 3 with string tokenizer took 750 ms
Check Test 3 with guava took 594 ms

How to make the Java implementation as fast as Guava? Maybe I'm doing it wrong?

Or maybe you know any tool/library as fast as Guava Splitter that does not involve pulling tons of unused classes just for this one?

Solution

First thing is that guava is so much more than just the Splitter, Predicate and Function - you are probably not using everything it has to offer; we use it hardcore and just hearing that makes me shiver. Anyhow, you tests are broken - in probably numerous ways. I've used JMH to test these two method just for the fun of it:

    @BenchmarkMode(org.openjdk.jmh.annotations.Mode.AverageTime) 
    @OutputTimeUnit(TimeUnit.NANOSECONDS) 
    @Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)   
    @Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS) 
    @State(Scope.Thread) public class GuavaTest {

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder().include(GuavaTest.class.getSimpleName())
                .jvmArgs("-ea", "-Xms10g", "-Xmx10g")
                .shouldFailOnError(true)
                .build();
        new Runner(opt).run();
    }

    @Param(value = { "300", "1000" })
    public String tokenToSearchFor;

    @State(Scope.Benchmark)
    public static class ThreadState {
        String longString = IntStream.range(1, 20000).boxed().map(Object::toString).collect(Collectors.joining(" ,"));

        StringTokenizer st = null;

        Pattern pattern = null;

        Splitter splitter = null;

        @Setup(Level.Invocation)
        public void setUp() {
            st = new StringTokenizer(longString, ",", false);
            pattern = Pattern.compile(",");
            splitter = Splitter.on(',').trimResults();
        }
    }

    @Benchmark
    @Fork(1)
    public boolean doWithStringTokenizer(ThreadState ts) {
        while (ts.st.hasMoreTokens()) {
            String t = ts.st.nextToken().trim();
            if (t.equals(tokenToSearchFor)) {
                return true;
            }
        }
        return false;
    }

    @Benchmark
    @Fork(1)
    public boolean doWithRegex(ThreadState ts) {
        return ts.pattern.splitAsStream(ts.longString)
                .map(String::trim)
                .anyMatch(tokenToSearchFor::equals);
    }

    @Benchmark
    @Fork(1)
    public boolean doWithGuava(ThreadState ts) {
        Iterable<String> iterable = ts.splitter.split(ts.longString);
        for (String s : iterable) {
            if (s.equals(tokenToSearchFor)) {
                return true;
            }
        }
        return false;
    }

}

And the results:

Benchmark                        (tokenToSearchFor)  Mode  Cnt       Score        Error  Units
GuavaTest.doWithGuava                           300  avgt    5   19284.192 ±  23536.321  ns/op
GuavaTest.doWithGuava                          1000  avgt    5   67182.531 ±  93242.266  ns/op
GuavaTest.doWithRegex                           300  avgt    5   65780.954 ± 169044.641  ns/op
GuavaTest.doWithRegex                          1000  avgt    5  182530.069 ± 409571.222  ns/op
GuavaTest.doWithStringTokenizer                 300  avgt    5   34111.030 ±  61014.332  ns/op
GuavaTest.doWithStringTokenizer                1000  avgt    5  118963.048 ± 165510.183  ns/op

That makes guava the fastest indeed.

If you add parallel to the splitAsStream then it will become interesting, a must read here