Search code examples
python-2.7cassandra-2.0datastax

Using Python Cassandra driver for multiple connections errors out


I am using the Python Cassandra driver offered by Datastax to connect to a single node Cassandra instance. My Python code spawns multiple processes (using the multiprocessing module), each of which opens a connection to this node, and shuts it down during exit.

Here's the behavior I observe: when the number of processes spawned is less (say ~ 30) my code runs flawlessly. But with a higher number I see errors like these (probably not surprising):

File "/usr/local/lib/python2.7/dist-packages/cassandra/cluster.py", line 755, in connect
self.control_connection.connect()
File "/usr/local/lib/python2.7/dist-packages/cassandra/cluster.py", line 1868, in connect
self._set_new_connection(self._reconnect_internal())
File "/usr/local/lib/python2.7/dist-packages/cassandra/cluster.py", line 1903, in _reconnect_internal

raise NoHostAvailable("Unable to connect to any servers", errors)
NoHostAvailable: ('Unable to connect to any servers', {'127.0.0.1': error(99, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Cannot assign requested address")})

Apparently, the host is unable to accept new connections. This is something that looks like should be taken care of by the driver or Cassandra - by having new connection requests queue up and grant them when it frees up.

How do I impose this behavior?


Solution

  • "Cannot assign requested address" can indicate that you're running out of local ports. This is not up to the driver -- it is a system configuration issue. Here is a good article about the problem (it refers to MySQL, but the issue is the same). Note that connections in TIME_WAIT state occupy local ports, and can linger beyond individual program runs.

    The article discusses multiple solutions, including expanded port ranges, listening on multiple IP addresses, or changing application connection behavior. I would consider application behavior, and recommend running fewer processes. Depending on what you're trying to overcome with multiprocessing, you'd probably be best served using (process count) <= (machine cores) (this is the default behavior of multiprocessing.Pool).