I'm reading an article about how to design a Twitter Search. The basic idea is to map tweets based on their ids to servers where each server has the mapping
English word -> A set of tweetIds having this word
Now if we want to find all the tweets that have some word all we need is to query all servers and aggregate the results. The article casually suggests that we can also sort the results by some parameter like "popularity" but isn't that a heavy task, especially if the word is an hot word?
What is done in practice in such search systems?
Maybe some tradeoff are being used?
Thanks!
First of all, there are two types of indexes: local and global.
A local index is stored on the same computer as tweet data. For example, you may have 10 shards and each of these shards will have its own index; like word "car" -> sorted list of tweet ids.
When search is run we will have to send the query to every server. As we don't know where the most popular tweets are. That query will ask every server to return their top results. All of these results will be collected on the same box - the one executing the user request - and that process will pick top 10 of of entire population.
Since all results are already sorted in the index itself, it is a O(1) operation to pick top 10 results from all lists - as we will be doing simple heap/watermarking on set number of tweets.
Second nice property, we can do pagination - the next query will be also sent to every box with additional data - give me top 10, with popularity below X, where X is the popularity of last tweet returned to customer.
Global index is a different beast - it does not live on the same boxes as data (it could, but does not have to). In that case, when we search for a keyword, we know exactly where to look for. And the index itself is also sorted, hence it is fast to get top 10 most popular results (or get pagination).
Since the global index returns only tweet Ids and not tweet itself, we will have to lookup tweets for every id - this is called N+1 problem - 1 query to get a list of ids and then one query for every id. There are several ways to solve this - caching and data duplication are by far most common approaches.