java performance hibernate hazelcast second-level-cache

Hazelcast + Hibernate query slowdown when using second level cache

we are currently using in Hibernate 6.3.1 as a second level cache Hazelcast 5.3.6.

we are experiencing a significant slow down in performance when a second node joins the cluster. currently Hazelcast is configured in embedded mode (hazelcast.jcache.provider.type=member) and in hibernate config we have hibernate.cache.region.factory_class=jcache, hibernate.cache.use_second_level_cache=true and hibernate.cache.use_query_cache=true

Hazelcast is configured in this way:

    <advanced-network enabled="true">
        <join>
            <auto-detection enabled="false"/>
            <multicast enabled="true">
                <multicast-group>224.2.2.3</multicast-group>
                <multicast-port>54327</multicast-port>
                <multicast-time-to-live>32</multicast-time-to-live>
                <multicast-timeout-seconds>5</multicast-timeout-seconds>
            </multicast>
        </join>

        <member-server-socket-endpoint-config>
            <port>5701</port>
            <socket-options>
                <keep-alive>true</keep-alive>
                <tcp-no-delay>true</tcp-no-delay>
                <buffer-direct>true</buffer-direct>
            </socket-options>
        </member-server-socket-endpoint-config>
        <client-server-socket-endpoint-config>
            <port>9090</port>
            <socket-options>
                <keep-alive>true</keep-alive>
                <tcp-no-delay>true</tcp-no-delay>
                <buffer-direct>true</buffer-direct>
            </socket-options>
        </client-server-socket-endpoint-config>
    </advanced-network>

    <cache name="*">
        <statistics-enabled>false</statistics-enabled>
        <management-enabled>false</management-enabled>
        <in-memory-format>BINARY</in-memory-format>
        <expiry-policy-factory>
            <timed-expiry-policy-factory expiry-policy-type="ETERNAL"/>
        </expiry-policy-factory>
        <eviction eviction-policy="LRU" max-size-policy="ENTRY_COUNT" size="10000"/>
    </cache>

when only one server is started we see in the Hibernate statistics output for one particular query:

o.h.e.i.StatisticalLoggingSessionEventListener - Session Metrics {
    13365 nanoseconds spent acquiring 3 JDBC connections;
    9838 nanoseconds spent releasing 3 JDBC connections;
    80964 nanoseconds spent preparing 3 JDBC statements;
    1092652 nanoseconds spent executing 3 JDBC statements;
    0 nanoseconds spent executing 0 JDBC batches;
    311366 nanoseconds spent performing 2 L2C puts;
    6779427 nanoseconds spent performing 45 L2C hits;
    181972 nanoseconds spent performing 2 L2C misses;
    0 nanoseconds spent executing 0 flushes (flushing a total of 0 entities and 0 collections);
    0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
}

when second server joins the same query produces:

INFO  o.h.e.i.StatisticalLoggingSessionEventListener - Session Metrics {
    11922 nanoseconds spent acquiring 3 JDBC connections;
    7064 nanoseconds spent releasing 3 JDBC connections;
    69741 nanoseconds spent preparing 3 JDBC statements;
    1039659 nanoseconds spent executing 3 JDBC statements;
    0 nanoseconds spent executing 0 JDBC batches;
    1093001 nanoseconds spent performing 2 L2C puts;
    10632154 nanoseconds spent performing 45 L2C hits;
    126368 nanoseconds spent performing 2 L2C misses;
    0 nanoseconds spent executing 0 flushes (flushing a total of 0 entities and 0 collections);
    0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
}

L2C puts and hits taking 2 up to 3 times longer. As this is not the only query the time accumulates more and more and is then also noticeable in the UI.

Any suggestion how to speed up the L2C hits and puts?

Solution

As a distributed cache, once Hazelcast grows beyond a single node you will start to see network performance impact the time needed for gets and puts, and this will always result in performance less than what can be achieved on a purely local (single node) solution.

There are configuration options that can improve the performance, but at the cost of less consistency/accuracy.

When reading, a single-node cluster will always find the record locally (if it exists). In a two-node cluster, we expect that 50% of gets will be satisfied locally, and 50% will require a network call. As clusters get larger, an even larger percentage of gets will be remote. So the performance degradation you see is a natural consequence of distributing data across the network. With an embedded Hazelcast configuration, you can set the 'read-from-backups' flag true, and this will allow a higher percentage of calls to be resolved locally - but some of those reads may be stale values if the primary copy of the data has been modified and the modification has not yet been propagated to the backup.

When writing, on a single node configuration Hazelcast writes the value and returns to the caller. On a multi-node configuration, Hazelcast will write first to the primary node, and then to the configured number of synchronous backup nodes, before returning to the caller - so writes will take longer as there will always be a backup copy to write (assuming default configuration). You can improve performance by switching from synchronous to asynchronous backups, or by turning off backups altogether, but each of these changes increases the chance of data loss in the event of a node outage. If all cached data is known to be safely stored in a system of record where it can be reloaded after a failure, then this may be an acceptable risk.