Search code examples
cachingindexingelasticsearch

Caching vs Indexing


What's the real difference between a caching solution and an indexing solution? It seems to me that an indexing solution is in fact caching with the ability to run search queries (like: Elastic Search). Would there ever be any real reason to use both a caching solution and indexing solution within the same project or does the indexing solution basically make any other caching redundant?

Example: Say I use NEST for ElasticSearch, which would store and return POCOs; if I then query ElasticSearch and have the POCO returned to me, isn't that considered to be using a cached object returned from ElasticSearch?

At the moment, I store data in a cache using an ICacheManager interface I have.. something like this:

return CacheManager.Get(cacheKey, () =>
{
    // return something...
});

Would this become redundant with ElasticSearch?

EDIT

Thanks to all of you for the answers. I am fully aware of what a cache is and already understood the general idea behind an index for textual searching, so I was only really wondering whether the index doubles as a cache already and would therefore make any other cache redundant. After all, I wouldn't want to keep 2 caches in memory (example: ElasticSearch + Redis) when one would do fine. I think I have a better idea now though; especially when I realized that not all fields are always stored in the index and so therefore we need to get the object from a cache or direct from the db anyway - at least in some cases. Thanks all!


Solution

  • The whole purpose of a cache is to return already requested data as fast as possible. One constraint of caches is that they cannot be too big either as the lookup time would increase and thus defeat the purpose of having a cache in the first place. That being said, it comes as no surprise that if you plan to have a few million/billion records in your DB, it won't be difficult to index them all but it will be difficult to cache them all, though since RAM is getting cheaper and cheaper, you might be able to store all you need in memory. You also need to ask yourself whether your cache needs to be distributed across several hosts or not (whether now or in the future).

    Considering that lookups and queries in ES are extremely fast (+ ES brings you many more benefits in addition to that, such as scoring), i.e. usually faster than retrieving the same data from your DB, it would make sense to use ES as a cache. One issue I see is a common one, i.e. as soon as you start duplicating data (DB -> ES), you need to ensure that both stores don't get out of synch.

    Now, if in addition you throw a cache into that mix, it's a third data store to maintain and to ensure is consistent with the main data store. If you know your data is pretty stable, i.e. written and then not updated frequently, then that might be ok, but you need to keep this very concern in mind all the time when designing your data access strategy.

    As @paweloque said, in the end it all depends on your exact use case(s). Every problem is different and I can attest that after a few dozen projects around ES over the past five years or so, I've never seen two projects configured the same way. A cache might make sense for some specific cases, but not at all for others.

    You need to think hard how and where you need to store your data, who is requesting them (and at what rate), who is creating/updating them (and at what rate), but in the end, the best practice is to keep your stack as lean as possible with only as few components as needed, each one being a potential bottleneck that you have to understand, integrate, maintain, tune and monitor.

    Finally, I'd add one more thing: adding a cache or an index should be considered a performance optimization of your software stack. As you probably know the common saying "Premature optimization is root of all evil", you should first go with your database only, measure the performance, load test it, then witness that it might not support the load. Then only, you can decide to throw a cache at it and/or an index depending on the needs. Again, load test, measure, then decide. If you only have ten users making a few requests per day, having only a DB might be perfectly fine. You have to understand when and why you need to add another layer on your Tower of Babel, but most importantly you need to add one layer at a time and see how that layer improves/degrades the stability of the stack.

    Last but not least, you can find some online articles from people having used ES as caches (mainly key-value stores, and object caches).