In the documentation of LocalNodeFirstLoadBalancingPolicy, it's mentioned that -
Selects local node first and then nodes in local DC in random order. Never selects nodes from other DCs. For writes, if a statement has a routing key set, this LBP is token aware - it prefers the nodes which are replicas of the computed token to the other nodes.
However in my spark jobs logs I can find all the Nodes is cluster being added.
21/05/05 10:08:40 INFO CassandraWriter$: Setting local_dc: DC1
21/05/05 10:08:40 INFO CassandraWriter$: Writing to DC: DC1, available host ips: x.x.x.54,x.x.x.237,x.x.x.168,x.x.x.197,x.x.x.219
21/05/05 10:08:41 INFO Cluster: New Cassandra host /x.x.x.219:9042 added
21/05/05 10:08:41 INFO Cluster: New Cassandra host /x.x.x.237:9042 added
21/05/05 10:08:41 INFO Cluster: New Cassandra host /x.x.x.54:9042 added
21/05/05 10:08:41 INFO Cluster: New Cassandra host /x.x.x.238:9042 added
21/05/05 10:08:41 INFO LocalNodeFirstLoadBalancingPolicy: Added host x.x.x.238 (DC2)
21/05/05 10:08:41 INFO Cluster: New Cassandra host /x.x.x.168:9042 added
21/05/05 10:08:41 INFO Cluster: New Cassandra host /x.x.x.42:9042 added
21/05/05 10:08:41 INFO LocalNodeFirstLoadBalancingPolicy: Added host x.x.x.42 (DC2)
21/05/05 10:08:41 INFO Cluster: New Cassandra host /x.x.x.109:9042 added
21/05/05 10:08:41 INFO LocalNodeFirstLoadBalancingPolicy: Added host x.x.x.109 (DC2)
Could somebody help me understand why DC2 nodes are being added? As per my understanding, coordinator nodes are always chosen from local_dc.
I have tried to run ingestion without setting spark.cassandra.connection.local_dc as well and have seen the same logs.
See write code below:
records.write.cassandraFormat(table, keySpace)
.mode(SaveMode.Append)
.option(CassandraConnectorConf.LocalDCParam.name, cassandraDC.name)
.option(CassandraConnectorConf.ConnectionHostParam.name, cassandraDC.availableHosts.mkString(","))
.save()
PS: I have separate spark and cassandra clusters & my use-case is to write data from spark cluster to cassandra.
You can ignore these messages. This is how the Cassandra works - drivers are discovering the full topology of the cluster on initialization, and then deciding to use only specific nodes from given datacenter.
For example, messages like New Cassandra host /x.x.x.54:9042 added
are coming from Java driver. And messages like Added host x.x.x.238 (DC2)
are coming from LocalNodeFirstLoadBalancingPolicy that must override the function in the interface. But then, load balancing policy doesn't use the nodes that aren't in the local data center, although always keeps a map of all nodes.