Search code examples
pythonelasticsearchelasticsearch-py

How to limit the number of data to be uploaded to ElasticSearch


How can I limit the number of data to be uploaded to Elasticsearch? My old laptop cannot process a huge dataset like the one I'm using.

I have used the following code to 'limit' the data to be uploaded

from elasticsearch import helpers, Elasticsearch
import csv
import itertools

with open('my_data.csv', encoding="utf8") as f:
    reader = csv.DictReader(f)
    for row in itertools.islice(reader, 1000): #limitation of data
        helpers.bulk(es, reader, index='movie-plots', doc_type=None)

But this is apparently not working; when I check with 'POST movie-plots/_count', it returns the initial size of the entire dataset.

I am completely new to Elasticsearch so sorry if this is a novice question. I am using Python client (in Jupyter notebook) in order to use Elasticsearch and Kibana.


Solution

  • You are calling islice on reader ... but then you are passing all of reader to helpers.bulk anyway.

    Not in a place where I can test; but try removing the for loop and just passing the islice to helpers.bulk directly:

    with open('my_data.csv', encoding="utf8") as f:
        reader = csv.DictReader(f)
        helpers.bulk(es, itertools.islice(reader, 1000), index='movie-plots', doc_type=None)