Search code examples
javacassandraastyanax

Handling failing seed nodes in Astyanax Cassandra API


Maybe I misunderstood how the automatic node discovery in the Astyanax Cassandra API works, but here is my problem:

I have the following setup:

2 Datacenters with 2 nodes each and a replication factor of 2.

DC1: N1 and N2 and DC2: N3 and N4

The seed nodes are N1 and N3 (also provided to the applicaton). The automatic discovery of the other nodes (N2 and N4) seems to work. Even though, they are not shown in the hosts pool.

If N3 fails, the data is correctly written to N4 and it is also correctly synchronized to N3 when the node comes up again. The same goes for N1 and N2.

The problem happens when both seed nodes (N1 and N3) fail. Then the data is no longer written to N2 and N4 (as expected), but an Exception causes the application to fail (Astyanax writes an info of an exception to the log, when one seed node is down, but this normally doesn't cause the application to fail).

It is clear that the seed nodes have to be online when the application starts, but I thought that the automatic node discovery in astyanax would allow the seed nodes to fail, so that the replication nodes can take over (using a consistency level of CL_ONE).

Is there a way to avoid this failure, or do I just misunderstand the automatic node discovery, or am I just doing something terribly wrong?

Some additional information: The nodes mainly use the default settings in cassandra.yaml and the tokens were generated with the python script, proposed in the documentation.

private AstyanaxContext<Cluster> connect(final String hosts) {
    AstyanaxConfigurationImpl asConfig = new AstyanaxConfigurationImpl();
    asConfig.setDefaultWriteConsistencyLevel(ConsistencyLevel.CL_ONE);
    asConfig.setDefaultReadConsistencyLevel(ConsistencyLevel.CL_ONE);
    AstyanaxContext<Cluster> context = new AstyanaxContext.Builder()
            .forCluster("TestSuitCluster")
            .withAstyanaxConfiguration(
                    asConfig.setDiscoveryType(NodeDiscoveryType.TOKEN_AWARE)
                    .setConnectionPoolType(ConnectionPoolType.TOKEN_AWARE))
            .withConnectionPoolConfiguration(
                    new ConnectionPoolConfigurationImpl(
                            "CassandraConnectionPool").setSeeds(hosts)
                            .setMaxConnsPerHost(8).setMaxConns(8))
            .withConnectionPoolMonitor(new ConnectionPoolMonitor())
            .buildCluster(ThriftFamilyFactory.getInstance());
    context.start();
    return context;
}

The stacktrace, that is shown, when the last seed node falls away:

com.netflix.astyanax.connectionpool.exceptions.PoolTimeoutException: PoolTimeoutException: [host=127.0.0.1(127.0.0.1):9160, latency=2000(2000), attempts=1]Timed out waiting for connection
    at com.netflix.astyanax.connectionpool.impl.SimpleHostConnectionPool.waitForConnection(SimpleHostConnectionPool.java:218)
    at com.netflix.astyanax.connectionpool.impl.SimpleHostConnectionPool.borrowConnection(SimpleHostConnectionPool.java:185)
    at com.netflix.astyanax.connectionpool.impl.RoundRobinExecuteWithFailover.borrowConnection(RoundRobinExecuteWithFailover.java:66)
    at com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:67)
    at com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:256)
    at com.netflix.astyanax.thrift.ThriftClusterImpl.describeKeyspaces(ThriftClusterImpl.java:165)
    at com.netflix.astyanax.thrift.ThriftClusterImpl.describeKeyspace(ThriftClusterImpl.java:184)
    at at.dbeg.cassandra.CasandraTestSuit.deleteKeyspace(CasandraTestSuit.java:134)
    at at.dbeg.cassandra.CasandraTestSuit.runTests(CasandraTestSuit.java:189)
    at at.dbeg.cassandra.CasandraTestSuit.main(CasandraTestSuit.java:50)    
com.netflix.astyanax.connectionpool.exceptions.ConnectionAbortedException: ConnectionAbortedException: [host=127.0.0.1(127.0.0.1):9160, latency=0(0), attempts=1]org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset by peer: socket write error
    at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:193)
    at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:65)
    at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:28)
    at com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl$ThriftConnection.execute(ThriftSyncConnectionFactoryImpl.java:151)
    at com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:69)
    at com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:256)
    at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.executeOperation(ThriftKeyspaceImpl.java:485)
    at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.access$000(ThriftKeyspaceImpl.java:79)
    at com.netflix.astyanax.thrift.ThriftKeyspaceImpl$6$3.execute(ThriftKeyspaceImpl.java:355)
    at at.dbeg.cassandra.CasandraTestSuit.testWrite(CasandraTestSuit.java:269)
    at at.dbeg.cassandra.CasandraTestSuit.runTests(CasandraTestSuit.java:168)
    at at.dbeg.cassandra.CasandraTestSuit.main(CasandraTestSuit.java:50)
Caused by: org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset by peer: socket write error
    at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
    at org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156)
    at org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:65)
    at org.apache.cassandra.thrift.Cassandra$Client.send_insert(Cassandra.java:833)
    at org.apache.cassandra.thrift.Cassandra$Client.insert(Cassandra.java:822)
    at com.netflix.astyanax.thrift.ThriftKeyspaceImpl$6$3$1.internalExecute(ThriftKeyspaceImpl.java:367)
    at com.netflix.astyanax.thrift.ThriftKeyspaceImpl$6$3$1.internalExecute(ThriftKeyspaceImpl.java:358)
    at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:60)
    ... 10 more
Caused by: java.net.SocketException: Connection reset by peer: socket write error
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
    at org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
    ... 17 more 

Solution

  • I think I finally found the answers. This is not possible in a Cluster Context without an own HostSupplier. The easiest way to solve this problem, is to iterate over all keyspaces in the cluster and use the logic of the RingDescribeHostSupplier to find all hosts.

    If this HostSupplier is used and set in the AstyanaxContext, then the expected behavior is shown.