Search code examples
amazon-web-serviceselasticsearchdeploymentblue-green-deployment

Blue/Green "deployment" of elasticsearch data?


I am planning on extracting (essentially scraping, with permission) some data from a web-page and store that in elasticsearch (you know, for search).

While I have permission to scrape the data from the site,

  • there is no API or another structured source for this data
  • it's manually authored straight into HTML
  • there are no unique identifiers that differentiate one entry from another (I will essentially be extracting around 1,000-5,000 entries from the DOM).

When I store this in es, I am planning to put this into one index and into a mapping type, say thing.

However, over time, the source (the HTML web page) is likely to change as they add/remove/change content of some of these entries. Since there are no identifiers in the source, I can't easily identify new ones (and even worse, deleted ones or changed ones).

I want to keep my es index up to date and what I am thinking is some sort of a blue-green mechanism:

  • I run the extraction process at some schedule (daily/weekly) depending on the velocity of the source changing
  • Every time it runs the process produces another index (or could be a new cluster altogether). Say the current index is index-prod and the new one built by the process is index-rc (release candidate)
  • It validates index-rc based on some heuristics (a flexible velocity check on the number of entries, sample queries that we know should work etc.)
  • And if it's valid, it either:
    • A. slowly flips queries into the new cluster/index
    • or B. flips in one shot to the new cluster/index

I am planning on hosting the elasticsearch cluster using AWS Elastisearch Service and could possibly concote something using Route 53 CNAMEs (and maybe ELB?) but I wanted to know if there is a more implicit support in elasticsearch itself for doing this?

Essentially, I want to swap one index's data for another.


Solution

  • You don't need to swap the entire data between indexes... if I get it right, you can use Aliases to change from the actual to the next index version.

    To slowly change the queries endpoint, I also suppose that some Load Balancer, like nginx, is the best solution. There are many cases about this on the web.