csv elasticsearch bigdata nest bulkinsert

How to index a 1 billion row CSV file with elastic search?

Imagine you had a large CSV file - let's say 1 billion rows.

You want each row in the file to become a document in elastic search.

You can't load the file into memory - it's too large so has to be streamed or chunked.

The time taken is not a problem. The priority is making sure ALL data gets indexed, with no missing data.

What do you think of this approach:

Part 1: Prepare the data

Loop over the CSV file in batches of 1k rows
For each batch, transform the rows into JSON and save them into a smaller file
You now have 1m files, each with 1000 lines of nice JSON
The filenames should be incrementing IDs. For example, running from 1.json to 1000000.json

Part 2: Upload the data

Start looping over each JSON file and reading it into memory
Use the bulk API to upload 1k documents at a time
Record the success/failure of the upload in a result array
Loop over the result array and if any upload failed, retry

Solution

The steps you've mentioned above looks good. A couple of other things which will make sure ES does not get under load:

From what I've experienced, you can increase the bulk request size to a greater value as well, say somewhere in the range 4k-7k (start with 7k and if it causes pain, experiment with smaller batches but going lower than 4k probably might not be needed).
Ensure the value of refresh_interval is set to a very great value. This will ensure that the documents are not indexed very frequently. IMO the default value will also do. Read more here.

As the above comment suggests, it'd be better if you start with a smaller batch of data. Of-course, if you use constants instead of hardcoding the values, your task just got easier.