Search code examples
javajdbclogstashojdbc

Logstash Java OutOfMemory only when use_prepared_statements=>true in jdbc when synchronizing initial huge dataset


We are getting Java OutOfMemory during initial loading a huge dataset of ~70 Mio records with Logstash and jdbc into Elastic. Java Heapsize is 4GB, but even with 8 or 16GB the situation is the same.

The very same dataset can be loaded successfully without use_prepared_statements (using literals).

We have successfully loaded the whole view once with use_prepared_statements=>false, but to have it more efficient even for delta loads, we would like to use prepared statements. Our idea is to use jdbc_fetch_size=>1000 to limit the amount of data within one fetch. According documentation, using jdbc_page_size or jdbc_paging_enabled is not in effect when using prepared_statements. It would even not be efficient because of the complex view behind the scene, which would not be efficient when multiple queries using offsets each time for the initial load.

Using prepared statements work within Oracle well and we see fetches were around 1 mio records out of ~70 mio done when Logstash failed with OutOfMemory failure.

Our current logstash config:

input {
    jdbc {
        type => "sim_list"
        jdbc_validate_connection => true
        jdbc_driver_library => "/pkg/moip/otc/apps/ls-otc-tmsp_tua2/conf/ojdbc6.jar"
        jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
        jdbc_connection_string => "jdbc:oracle:thin:@db-server:1521:TA2A"
        jdbc_user => "test"
        jdbc_password => "testpwd"
        tracking_column => "last_import"
        use_column_value => true
        tracking_column_type => "timestamp"
        prepared_statement_bind_values => [":sql_last_value"]
        prepared_statement_name => "get_sim_list_tua2_v5"
        use_prepared_statements => true
        last_run_metadata_path => "/pkg/moip/otc/data/ls-otc-tmsp_tua2/jdbc/sql_last_run_sim_list_tua2_v5"
        statement => "SELECT simlist.*, simlist.LAST_UPDATE as last_import
FROM (SELECT *
      FROM V_SIM_LIST_VIEW
        WHERE last_update between (?) - 1/86400 and sysdate
          ORDER BY last_update ASC) simlist"
        # run every 5 seconds
        schedule => "*/5 * * * * *"
        connection_retry_attempts => "3"
        jdbc_fetch_size => 1000
    }
}

filter {
    mutate {
        copy => {
            "%{di_id}" => "[@metadata][_id]"
        }
        split => { "user_ids" => "," }
        split => { "option_ids" => "|" }
        split => {"apn_ids" => "|"}
        split => { "ip_address" => "|" }
    }
}

output {
    if [type] == "sim_list" {
        elasticsearch {
            hosts => ["https://ece.mydomain.com:443"]
            index => "sim_list_tua2_v5"
            document_id => "%{di_id}"
            #doc_as_upsert => true
            manage_template => false
            cacert => "/pkg/moip/otc/apps/ls-otc-tmsp_tua2/conf/certs/ca.crt"
            ssl => true
            ssl_certificate_verification => true
            user => "logstash_write"
            password => "${LSWRITEPWD}"
        }
    }
}

Idea is, to use one big query for initial load, which should be split by jdbc_fetch_size. Next schedules shall use :sql_last_value to get incremential changes. However, it breaks soon after first round about 1000 fetches.

Currently, it seems, the jdbc_fetch_size doesn't work as expected. So we have no clue, why it fails all the time.

We tested with ojdbc6.jar and odbc10.jar against Oracle 19c, but no difference. Logstash is running with Java 11.0.10.


Solution

  • We found following nice working workaround:

    Using logstash with two different configs (anyway required):

    1. for Initial-Sync: Using literals use_prepared_statements=>false
    2. for Delta-Sync: Using use_prepared_statements=>true

    In addition, since we have to use two different logstash configs anyway, we used optimizer hints for the SQL of Initial-Sync and without those hints for the SQL of the Delta-Sync. So, we have optimal config for both scenarios.