java autocomplete linkedin-api typeahead cleo

Is Cleo (linkedin's autocomplete solution) suitable for billions of elements?

Cleo has several different type of lookahead searches which are backed by some very clever indexing strategies. The GenericTypeahead is presumably for the largest of datasets. From http://sna-projects.com/cleo/design.php: "The GenericTypeahead is designed for large data sets, which may contain millions of elements..." Unfortunately the documentation doesn't go into how well or how the Typeahead's scale up. Has anyone used Cleo for very large datasets that might have some insight?

Solution

Cleo is for a single instance/node (i.e. a single JVM) and does not have any routing or broker logic. Within a single Cleo instance, you can have multiple logical partitions to take advantage of multi-core CPUs. On a typical commodity box with 32G - 64G memory, you can easily support tens of millions elements by setting up 2 or 3 Cleo GenericTypeahead instances.

To support billions of elements, you will have to use horizontal partitioning to set up many Cleo instances on many commodity boxes and then do scatter-and-gather.

Check out https://github.com/jingwei/cleo-primer to see how to set up a single Cleo GenericTypeahead instance within minutes.

Cheers.