Search code examples
javascriptnode.jselasticsearchweb-scrapingebay-api

How to avoid inserting a duplicate document to ElasticSearch


I'm scraping a large set of items using node.js/request and mapping the fields to ElasticSearch documents. The original documents have an ID field which never changes:

{ id: 123456 }

Periodically, I'd like to "refresh" and see which original items are no longer available, for whatever reason. Currently, I have a script which scrapes directly and simply inserts into Elastic.

Is there a way to check if an item with the same ID already exists before doing an insert? I don't want to end up with a ton of duplicates.


Solution

  • Are you using your ID as the document _id? Then it should be easy by using the operation type where you can specify that a document with a specific ID should only be created, but not overwritten:

    PUT your-index/your-type/123456/_create
    {
        "foo" : "bar",
    }