Search code examples
pythonelasticsearchavro

Indexing avro file to elasticsearch in bulk


I wrote this short simple script

from elasticsearch import Elasticsearch
from fastavro import reader

es = Elasticsearch(['someIP:somePort'])
with open('data.avro', 'rb') as fo:
    avro_reader = reader(fo)
    for record in avro_reader:
        es.index(index="my_index", body=record)

It works absolutely fine. Each record is a json and Elasticsearch can index json files. But rather than going one by one in a for loop, is there a way to do this in bulk? Because this is very slow.


Solution

  • There are 2 ways to do this.

    1. Use Elasticsearch Bulk API and requests python
    2. Use Elasticsearch python library which internally calls the same bulk API
        from elasticsearch import Elasticsearch
        from elasticsearch import helpers
        from fastavro import reader
        
        es = Elasticsearch(['someIP:somePort'])
        
        with open('data.avro', 'rb') as fo:
            avro_reader = reader(fo)
            records = [
                {
                    "_index": "my_index",
                    "_type": "record",
                    "_id": j,
                    "_source": record
                }
                for j,record in enumerate(avro_reader)
                ]
            helpers.bulk(es, records)