Search code examples
scalalogstashbulk

Bulk insertion of data to elasticsearch via logstash with scala


I need to insert large bulk data to elasticsearch regulary via scala code. When googling, I found to use logstash for large insertion rate but logstash doesn't have any java libraries or Api to call so I tried to connect to it via http client. I don't know it is a good approach to send large data with http protocol or better to use other approaches for example using broker, queues, redis, etc.

I know last versions of logstash(6.X,7.x) enables uses of persistent queue so it can be another solution to use logstash queue but again through http or tcp protocol.

Also note that reliability is the first priority for me since data must not be lost and there should be a mechanism to return response in code so to handle success or failure.

I would appreciate any ideas.

Update

It seems using http is robust and has acknowledgement mechanism based on here but if taking this approach, what http client libs in scala is more appropriate as I need to send bulk data in sequence of key value format and handle response in none-blocking way?


Solution

  • It may sound overkill but introducing a buffering layer between scala code and logstash may prove helpful as you can get rid of heavy HTTP calls and rely on lightweight protocol transport.

    Consider adding Kafka between your scala code and logstash for queuing of messages. Logstash can reliably process the messages from Kafka using TCP transport and bulk insert into ElasticSearch. On the other hand, you can put messages into Kafka from your scala code in build (batches) to make the whole pipeline work efficiently.

    With that being said, if you don't have a volume in let say 10,000 msgs/sec then you can also consider fiddling around with logstash HTTP input plugin by tweaking threads and using multiple logstash processes. This is to reduce the complexity of adding another moving piece (Kafka) into your architecture.