Search code examples
elasticsearchelasticsearch-bulk-apielasticsearch-bulk

elasticsearch bulk indexing and redundant data in action part


When indexing data using bulk API of elasticsearch here is the sample json from the site documentation

POST _bulk
{ "index" : { "_index" : "test", "_type" : "_doc", "_id" : "1" } }
{ "field1" : "value1" }
{ "index" : { "_index" : "test", "_type" : "_doc", "_id" : "2" } }
{ "field1" : "value2" }
{ "index" : { "_index" : "test", "_type" : "_doc", "_id" : "3" } }
{ "field1" : "value3" }

While "preparing" the data to be used by the bulk API, on first line I have to specify the operation and in next line I will provide data. Some redundant parts on each line might look obvious and pretty harmless but when I am indexing trillions of rows, doesn't it add up to latency? Is there is better way to push all the rows by specifying the index name and type only once at the header? Specially when I can use autogenerated id, I can avoid generating terabytes of data just to be prepended to every row for the same purpose again and again.

I believe I am missing something obvious here otherwise I am sure those guys at elastic are smart enough to have figured it out already and if they have done it this way, there should be some reason. But what?


Solution

  • Here you have shortcut:

    POST /test/_doc/_bulk
    { "index": {} }
    { "field1" : "value1" }
    { "index": {} }
    { "field1" : "value2" }
    { "index": {} }
    { "field1" : "value3" }
    

    Unfortunately you still need to repeat the { "index": {} } line but index name and document type you have specified in the path.

    Please see more options in the Cheaper in Bulk article.