Search code examples
mysqldatabaseamazon-web-servicessolramazon-cloudsearch

How often should I upload documents to CloudSearch (Solr)?


Here is my use case:

I use MySQL as my primary data store and CloudSearch for searching. The database contains tables: threads, comments, upvotes, users.

I created an expression to sort search results based on "trending" using upvotes and created_at date (Hacker News Hot algorithm). This expression is called "trend", and used in a CloudSearch query like this: /search?q=Superman&sort=trend+desc

(upotes-1)/pow(floor((_time-created_at)/3600000)+2, 1.8)

Right now, when a user upvotes a thread or comment, it is stored in MySQL database. My question how should I keep the upvotes in sync with CloudSearch?

The two options I see:

  1. Immediately insert (replace) an upvote in MySQL, then update the score on CloudSearch. This involves sending a single document upload on every upvote, but ensures real-time accuracy.
  2. Immediately insert (replace) an upvote in MySQL, then keep the upvote in cache somewhere (Redis?). Once every hour, upload all the upvotes to CloudSearch.

What is the best way to handle this situation?


Solution

  • It really depends on a lot of things

    1. Your solr setup, how many servers, how much memory, CPU, storage, how many documents, what is the index size per shard/server etc.

    2. How many "estimated" upvotes are you expecting? If you take option 1, it will be easier to decide if you can some how estimate this number.

      Since you are using SolrCloud, it has NRT feature which makes sure that documents are almost immediately available for searches. But again it depends on your current document corpus, and how many updates per second or minute you are expecting.

    If you know the number of upvotes (updates to solr) and if you have sufficiently good servers, I would go with option 1, since it will reduce the overhead of manitaining another database, and logic for updating upvotes every hour into solr.

    You can always setup a couple of test servers, and do some stress testing to find out the exact number of updates under which Solr's performance will reduce.

    I know this probably does not give you an exact yes or no, but like I said, it really depends on your particular use case.