Search code examples
jboss7.xinfinispanjgroups

Infinispan Initial State Transfer Hangs and times out


I'm trying to cluster a pair of servers with a shared Infinispan cache (Replicated Asynchronously). One always starts successfully, and registers itself properly with the JDBC database. When the other starts, it registers properly with the database, and I see a bunch of chatter between them, then, while waiting on a response from the second server, I get

`org.infinispan.commons.CacheException: Initial statue transfer timed out`

I think it's just an issue of configuration, but I'm not sure how to debug my configuration issues. I've spent several days configuring and re-configuring my Infinispan XML, and my JGroups.xml:

Infinispan:

<?xml version="1.0" encoding="UTF-8"?>
<infinispan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:infinispan:config:6.0"
xsi:schemaLocation="urn:infinispan:config:6.0     http://www.infinispan.org/schemas/infinispan-config-6.0.xsd
                   urn:infinispan:config:remote:6.0 http://www.infinispan.org/schemas/infinispan-cachestore-remote-config-6.0.xsd"
xmlns:remote="urn:infinispan:config:remote:6.0"
>

    <!-- *************************** -->
    <!-- System-wide global settings -->
    <!-- *************************** -->

    <global>
        <shutdown hookBehavior="DEFAULT"/>
        <transport clusterName="DSLObjectCache">
            <properties>
                <property name="configurationFile" value="jgroups.xml"/>
            </properties>
        </transport>
        <globalJmxStatistics enabled="false" cacheManagerName="Complex.com"/>
    </global>
    <namedCache name="ObjectCache">
        <transaction transactionMode="TRANSACTIONAL" />
        <locking
            useLockStriping="false"
        />
        <invocationBatching enabled="true"/>
        <clustering mode="replication">
            <async asyncMarshalling="true" useReplQueue="true" replQueueInterval="100" replQueueMaxElements="100"/>
            <stateTransfer fetchInMemoryState="true" />
        </clustering>
        <eviction strategy="LIRS" maxEntries="500000"/>
        <expiration lifespan="86400000" wakeUpInterval="1000" />
    </namedCache>

    <default>
        <!-- Configure a synchronous replication cache -->
        <locking
            useLockStriping="false"
        />
        <clustering mode="replication">
            <async asyncMarshalling="true" useReplQueue="true" replQueueInterval="100" replQueueMaxElements="100"/>
            <stateTransfer fetchInMemoryState="true" />
        </clustering>
        <eviction strategy="LIRS" maxEntries="500000"/>
        <expiration lifespan="86400000" wakeUpInterval="1000" />
        <persistence>
            <cluster remoteCallTimeout="60000" />
        </persistence>
    </default>
</infinispan>

Jboss.xml:

<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.0.xsd">
    <!-- Default the external_addr to #DEADBEEF so we can see errors coming through
        on the backend -->
    <TCP
        external_addr="${injected.external.address:222.173.190.239}"
        receive_on_all_interfaces="true"
        bind_addr="0.0.0.0"
        bind_port="${injected.bind.port:12345}"
        conn_expire_time="0"
        reaper_interval="0"
        sock_conn_timeout="20000"
        tcp_nodelay="true"

    />
    <JDBC_PING
        datasource_jndi_name="java:jboss/datasources/dsl/control"
    />
    <MERGE2 max_interval="30000" min_interval="10000"/>
    <FD_SOCK
        external_addr="${injected.external.address:222.173.190.239}"
        bind_addr="0.0.0.0"
    />
    <FD timeout="10000" max_tries="5"/>
    <VERIFY_SUSPECT timeout="1500"
        bind_addr="0.0.0.0"
    />
    <pbcast.NAKACK use_mcast_xmit="false"
              retransmit_timeouts="300,600,1200,2400,4800"
              discard_delivered_msgs="true"/>
    <UNICAST3 ack_batches_immediately="true"
    />
    <RSVP ack_on_delivery="true"
        throw_exception_on_timeout="true"
        timeout="1000"
    />
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                  max_bytes="400000"/>
    <pbcast.GMS print_local_addr="true" join_timeout="5000"
           view_bundling="true" view_ack_collection_timeout="5000"/>
    <FRAG2 frag_size="60000"/>
    <pbcast.STATE_SOCK
        bind_port="54321"
        external_addr="${injected.external.address:222.173.190.239}"
        bind_addr="0.0.0.0"
    />
    <pbcast.FLUSH timeout="1000"/>
</config>

I've tried, frankly, every configuration option I can think of, and I'm not sure why the replication keeps timing out. All communication between these servers is wide open. Sorry to just dump so much XML, but I'm not even sure how to collect more information.


Solution

  • Continued exploration indicated that Infinispan was pushing logs to the server.log, but - due to my configuration, this was not duplicated on the console. Further inspection revealed that I left a single element in my cache objects unserializable - making it impossible for it to be written to the wire and transferred. The logs are very specific, making this actually a very easy problem to track down once I realized where the logs were being written.

    If you come here from the future, my advice is to just tail every single log you can on the working server, and see what comes up.