We have billions of records indexed in ES cluster, each document will contain fields like account id, transaction id, user name and so on (few free-text string data fields)
My application will query ES based on some user search params (e.g return transactions for user 'A' between X and Y dates and some other filters) and I want to store/export response data to csv/excel file.
For my use case, number of documents returned from ES might be in 100s of thousands or million(s), my question is what are various ways to export "large" amount of data from ES?
These requests are "real-time" requests and not batch processing (e.g - requested user is waiting for exported file to be created).
I read about pagination (size/from) and scroll approach but not sure if these are the best ways to export large dataset from ES. (size/from approach has max setting as 10K if I read it correctly and scroll option is NOT much recommended for realtime use case).
Would like to know from experts.
If your users need to export a large quantity of data, you need to educate them not to expect that export to be done in real-time (for the sake of the well-being of your other users and your systems).
That's definitely a batch processing job. The user triggers the export via your UI, some process will then wake up and do it asynchronously. When done you notify the user that the export is available for download at some location or you send the file via email.
Just to name an example, when you want to export your data from Twitter, you trigger a request and you'll be notified later (even if you have just a few tweets in your account) that your data has been exported.
If you decide to proceed that way, then nothing prevents you anymore from using the scan/scroll approach.