Search code examples
elasticsearchlogstashlogstash-jdbc

Unbelievably slow indexing in ElasticSearch


We decided to include search engine in our product. And comparing ElasticSearch and Solr. When we started work with Elastic 2.3.3. We faced with the problem of slow indexing. We feed elastic using Logstash, and indexing of table with 4000000 records took more than 8 hours. The physical size of the table is near 40GB. We using HDD...Yes, it's a pity. But on the same PC, we tested Solr and the same operation took 3 hours. Maybe we've made a mistake in the configuration of elastic? And another moment index size of elastic was more than twice bigger than table size, and solr index was only 8% of DB size. When we using the logstash to output data in file it makes it very fast.

Here our configuration of the jdbc module of the logstash for elastic:

input { 
   jdbc {
        jdbc_driver_library => "F:\elasticsearch-2.3.3\lib\sqljdbc_4.0\enu\sqljdbc4.jar"
        jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
        jdbc_connection_string => "jdbc:sqlserver://s_tkachenko\mssql2014:49172;databaseName=work"
        jdbc_user => "sa"
        jdbc_password => "__masked_password__"
        statement => "SELECT id, name FROM Contact"                     
        }
      }

We set up only 1 shard and no Replicas.

Dear Community, maybe you have any bits of advice because support of elastic will help us only after we buy a subscribe. But buying subscribe of product that works not good at all, I don't think it's a great idea. Thank you for your attention, waiting for your thoughts.


Solution

  • In the mean time you can make some changes in logstash too:

    • Specify worker count with: -w {WORKER_COUNT}. cpu count * 2 is best in my experiments.
    • Specify buffer size with: -u {BUFFER_SIZE}. 512 worked best for me.

    You can also specify output worker count and flush beffer for elasticsearch output plugin:

    output {
        elasticsearch {
            # elasticsearch hosts
            hosts => ["127.0.0.1"]
            # bulk message size
            flush_size => 512
            # output worker cpu core * 2
            workers => 8
        }
    }
    

    Hope some of these help.