Should a TCP client be able to pause the server, when the TCP server reads a non-blocking socket

Overview

I have a simple question with code below. Hopefully I didn't make a mistake in the code.

I'm a network engineer, and I need to test certain linux behavior of our business application keepalives during network outages (I'm going to insert some iptables stuff later to jack with the connection - first I want to make sure I got the client & server right).

As part of a network failure test I'm conducting, I wrote a non-blocking Python TCP client and server that are supposed to blindly send messages to each other in a loop. To understand what's happening I am using loop counters.

The server's loop should be relatively straightforward. I loop through every fd that select says is ready. I never even import sleep anywhere in my server's code. From this perspective, I don't expect the server's code to pause while it loops over the client's socket ~~, but for some reason the server code pauses intermittently~~ (more detail, below).

I initially didn't put a sleep in the client's loop. Without a sleep on the client side, the server and client seem to be as efficient as I want. However, when I put a time.sleep(1) statement after the client does an fd.send() to the server, the TCP server code ~~intermittently~~ pauses while the client is sleeping.

My questions:

~~Should I be able to write a single-threaded Python TCP server that doesn't pause when the client hits time.sleep() in the client's fd.send() loop? If so, what am I doing wrong?~~ <- ANSWERED
If I wrote this test code correctly and the server shouldn't pause, why is the TCP server ~~intermittently~~ pausing while it polls the client's connection for data?

Reproducing the scenario

I'm running this on two RHEL6 linux machines. To reproduce the issue...

Open two different terminals.
Save the client and server scripts in different files
Change the shebang path to your local python (I'm using Python 2.7.15)
Change the SERVER_HOSTNAME and SERVER_DOMAIN in the client's code to be the hostname and domain of the server you're running this on
Start the server first, then start the client.

After the client connects, you'll see messages as shown in EXHIBIT 1 scrolling quickly in the server's terminal. ~~After a few seconds~~ The scrolling pauses ~~intermittently~~ when the client hits time.sleep(). I don't expect to see those pauses, but maybe I've misunderstood something.

EXHIBIT 1

---
LOOP_COUNT 0
---
LOOP_COUNT 1
---
LOOP_COUNT 2
---
LOOP_COUNT 3
CLIENTMSG: 'client->server 0'
---
LOOP_COUNT 4
---
LOOP_COUNT 5
---
LOOP_COUNT 6
---
LOOP_COUNT 7
---
LOOP_COUNT 8
---
LOOP_COUNT 9
---
LOOP_COUNT 10
---
LOOP_COUNT 11
---

Summary resolution

If I wrote this test code correctly and the server shouldn't pause, why is the TCP server intermittently pausing while it polls the client's connection for data?

Answering my own question. My blocking problem was caused by calling select() with a non-zero timeout.

When I changed select() to use a zero-second timeout, I got expected results.

Final non-blocking code (incorporating suggestions in answers):

See my answer below

Original Question Code:

tcp_server.py

#!/usr/bin/python -u
from socket import AF_INET, SOCK_STREAM, SO_REUSEADDR, SOL_SOCKET
#from socket import MSG_OOB  <--- for send()
from socket import socket
import socket as socket_module
import select
import fcntl
import os

host = ''
port = 9997

serv_sock = socket(AF_INET, SOCK_STREAM)
serv_sock.setsockopt(SOL_SOCKET, SOCK_STREAM, 1)
serv_sock.bind((host, port))
serv_sock.listen(5)

fcntl.fcntl(serv_sock, fcntl.F_SETFL, os.O_NONBLOCK)  # Make the socket non-blocking

sock_list = [serv_sock]

from_client_str = '__DEFAULT__'

to_client_idx = 0
loop_count = 0
while True:
    recv_ready_list, send_ready_list, exception_ready = select.select(sock_list, sock_list,
        [], 5)

    print "---"
    print "LOOP_COUNT",  loop_count

    ## Read all sockets which are input-ready... might be client or server...
    for sock_fd in recv_ready_list:

        # accept() if we're reading on the server socket...
        if sock_fd is serv_sock:
            clientsock, clientaddr = sock_fd.accept()
            sock_list.append(clientsock)

        # read input from the client socket...
        else:
            try:
                from_client_str = sock_fd.recv(4096)
                if from_client_str=='':
                    # Client closed the socket...
                    print "CLIENT CLOSED SOCKET"
                    sock_list.remove(sock_fd)
            except socket_module.error, e:
                print "WARNING RECV FAIL"


            print "from_client_str: '{0}'".format(from_client_str)

    for sock_fd in send_ready_list:
        if sock_fd is not serv_sock:
            try:
                to_client_str = "server->client: {0}\n".format(to_client_idx)
                sock_fd.send(to_client_str)
                to_client_idx += 1
            except socket_module.error, e:
                print "TO CLIENT SEND ERROR", e

    loop_count += 1

tcp_client.py

#!/usr/bin/python -u
    
from socket import AF_INET, SOCK_STREAM
from socket import gethostname, socket
import socket as socket_module
import select
import fcntl
import errno
import time
import sys
import os

## NOTE: Using this script to simulate a scheduler
SERVER_HOSTNAME = 'myHostname'
SERVER_DOMAIN = 'mydomain.local'
PORT = 9997

def handle_socket_error_continue(e):
    ## non-blocking socket info from:
    ## https://stackoverflow.com/a/16745561/667301
    print "HANDLE_SOCKET_ERROR_CONTINUE"
    err = e.args[0]
    if (err==errno.EAGAIN) or (err==errno.EWOULDBLOCK):
        print 'CLIENT DEBUG: No data input from server'
        return True
    else:
        print 'FROM SERVER RECV ERROR: {0}'.format(e)
        sys.exit(1)

c2s = socket(AF_INET, SOCK_STREAM) # Client to server socket...
c2s.connect(('.'.join((SERVER_HOSTNAME, SERVER_DOMAIN,)), PORT))
# Set socket non-blocking...
fcntl.fcntl(c2s, fcntl.F_SETFL, os.O_NONBLOCK)

to_srv_idx = 0
while True:
    socket_list = [c2s]

    # Get the list sockets which can: take input, output, etc...
    recv_ready_list, send_ready_list, exception_ready = select.select(
        socket_list, socket_list, [])

    for sock_fd in recv_ready_list:
        assert sock_fd is c2s, "Strange socket failure here"

        #incoming message from remote server
        try:
            from_srv_str = sock_fd.recv(4096)
        except socket_module.error, e:
            ## https://stackoverflow.com/a/16745561/667301
            err_continue = handle_socket_error_continue(e)
            if err_continue is True:
                continue
        else:
            if len(from_srv_str)==0:
                print "SERVER CLOSED NORMALLY"
                sys.exit(0)

        ## NOTE: if we get this far, we successfully received from_srv_str.
        ##    Anything caught above, is some kind of fail...
        print "from_srv_str: {0}".format(from_srv_str)

    for sock_fd in send_ready_list:
        #incoming message from remote server
        if sock_fd is c2s:
            #to_srv_str = raw_input('Send to server: ')
            try:
                to_srv_str = 'client->server {0}'.format(to_srv_idx)
                sock_fd.send(to_srv_str)

                               ##
                time.sleep(1)  ## Client blocks the server here... Why????
                               ##

                to_srv_idx += 1
            except socket_module.error, e:
                print "TO SERVER SEND ERROR", e

Solution

However, when I put a time.sleep(1) statement after the client does an fd.send() to the server, the TCP server code intermittently pauses while the client is sleeping.

AFAICT from running the provided code (nice self-contained example, btw), the server is behaving as intended.

In particular, the semantics of the select() call are that select() shouldn't return until there is something for the thread to do. Having the thread block inside select() is a good thing when there is nothing that the thread can do right now anyway, since it prevents the thread from spinning the CPU for no reason.

So in this case, your server program has told select() that it wants select() to return only when at least one of the following conditions is true:

serv_sock is ready-for-read (which is to say, a new client wants to connect to the server now)
serv_sock is ready-for-write (I don't believe this ever actually happens on a listening-socket, so this criterion can probably be ignored)
clientsock is ready-for-read (that is, the client has sent some bytes to the server and they are waiting in clientsock's buffer for the server thread to recv() them)
clientsock is ready-for-write (that is, clientsock has some room in its outgoing-data-buffer that the server could send() data into if it wants to send some data back to the client)
Five seconds have passed since the call to select() started blocking.

I see (via print-debugging) that when your server program blocks, it is blocking inside select(), which indicates that none of the 5 conditions above are being met during the blocking-period.

Why is that? Well, let's go down the list.

Not met because no other clients are trying to connect
Not met because this never happens
Not met because the server has read all of the data that the connected client has sent (and since the connected client is itself sleeping, it's not sending any more data)
Not met because the server has filled up the outgoing-data buffer of its clientsock (because the client program is sleeping, it's only reading the data coming from the server intermittently, and the TCP layer guarantees lossless/in-order transmission, so once clientsock's outgoing-data-buffer is full, clientsock won't select-as-ready-for-write unless/until the client reads at least some data from its end of the conenction)
Not met because 5 seconds haven't elapsed yet since select() started blocking.

So is this behavior actually a problem for the server? In fact it is not, because the server will still be responsive to any other clients that connect to the server. In particular, select() will still return right away whenever serv_sock or any other client's socket select()s as ready-for-read (or ready-for-write) and so the server can handle the other clients just fine while waiting for your hacked/slow client to wake up.

The hacked/slow client might be a problem for the user, but there's nothing the server can really do about that (short of forcibly disconnecting the client's TCP connection, or maybe printing out a log message requesting that someone debug the connected client program, I suppose :)).

I agree with EJP, btw -- selecting on ready-for-write should only be done on sockets that you actually want to write some data to. If you don't actually have any desire to write to the socket ASAP, then it's pointless and counterproductive to instruct select() to return as soon as that socket is ready-for-write: the problem with doing so is that you're likely to spin the CPU a lot whenever any socket's outgoing-data-buffer is less-than-full (which in most applications, is most of the time!). The user-visible symptom of the problem would be that your server program is using up 100% of a CPU core even when it ought to be idle or mostly-idle.