Search code examples
pythonhttpparsingtcpstream

Parse HTTP responses from a TCP stream


TCP is not a message based protocol, yet it's a simple stream of bytes. The HTTP protocol is, in fact, a message based protocol over TCP. How would, then, one go about parsing raw HTTP data from a TCP stream connection?

For example, we connect to a proxy server via a TCP socket in python:

import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((host, port))  # host and port are proxy's address

Then, we ask proxy, if we can CONNECT to a target host (google.com, for example) through it:

request = b'CONNECT %s:%i HTTP/1.0\r\n\r\n' % ("google.com".encode(), 443) 
s.sendall(request)

Then, we need to receive data from the socket. But how? If we recv data, we save it into a buffer, like so:

buffer = s.recv(1024)

I've checked, that when the host closes the connection, it sends a 0 bytes long message (for example 404, 502, 400 status codes). But when the connection is alive (host returned with status code 200), it doesn't send the terminating 0 bytes. Of course, it shouldn't, but then, how do we know, that this is the end of the message?

What I've made of the HTTP protocol is that headers are divided by \r\n and the body is divided from the headers with \r\n\r\n. An HTTP message always ends with \r\n. So, theoretically, we could just read message until we meet \r\n\r\n, then we know that the rest of the message, until another \r\n, is the body of the response.

But what if some joker server desides to put another \r\n inside the http response body? Then the whole parsing is broken! Now the algorithm thinks the body is over an that the rest of the message is next message's headers and throws an exception, trying to parse it! What if some fun guy writes a server, which puts an \r\n\r\n inside a custom response header?

How can we go about parsing from raw socket then, how is it done right? How do we not slip up on some misconfigured server's responses?


Solution

  • That's not a very precise description of HTTP. Although the protocol certainly has its flaws, it is much more robust than your summary indicates, as demonstrated by the enormous amount of data successfully transmitted.

    Of course, successful transmission requires that the server correctly implements the protocol. Server bugs will make correct reception of a message impossible. If a server were to send an extra CR-LF in the header, for example, the client would assume that what follows is the message body, which is likely to result in some kind of failure. However, the body of the message is not so sensitive. Any arbitrary stream of bytes, including arbitrary line endings and even NUL bytes can be transmitted over HTTP.

    There are three mechanisms used to packetise the body. In the original HTTP specification, the body simply extends up to the point that the TCP connection is closed by the server, so a single TCP connection could only serve a single HTTP response.

    By the way, servers don't send zero-length messages before closing a connection. There's no way to do that because, as you point out, TCP is just a stream of bytes. It's not a message-based protocol at all; so there's no possibility to send a message of any length, including zero.

    The zero-length return from read() is fabricated by the standard library on the receiving side in order to communicate to the caller of read() that there is no more data; in other words, that the connection has been closed by the other end. This is identical to the way that read() from a file signifies that the end of the file has been reached. When you read() from a file and receive zero bytes, that is not because there is dome "zero-length packet" in the file. Like a TCP stream, a file is just a sequence of undifferentiated bytes without message markers.

    But to get back to HTTP. Since nothing stops a client from opening an arbitrary number of connections to a single server, the original "one connection, one request" communication protocol was workable. But there is considerable overhead in opening and tearing down a TCP connection, and many HTTP messages are very short. So that wasn't very scalable and the next HTTP version had to include a mechanism to send multiple messages over a single TCP connection (called "pipelining").

    However, having a lot of dormant open TCP connections also creates undesirable overhead for the server. So it is still allowed to close a connection at any time; if the client then wants to make a new request, it must open a new connection.

    Both client requests and server responses consist of a header possibly followed by a body. The existence of the body depends on the contents of the header, and correct functioning of pipelined transmission requires server and client to agree about whether a particular message header will be followed by a body or not. An unexpected body will be interpreted on the other side as a new header, which will probably be ill-formed.

    There are two ways for the sender to describe the extent of the body of a message. The most straightforward is to simply include a header containing the precise length in bytes of the body. (A "Content-Length" header.) After the header is done (signalled by two consecutive CRLF sequences), the next number of bytes indicated by the content length header are taken as the body, without needing to look at the bytes. (If the server injects extra bytes not counted in the declared content length, that will cause a parsing error at the other end, and the same if it leaves bytes out. But there is no problem with the message including any number of consecutive CRLFs.)

    Once the body has been fully sent, the sender can indicate that another message follows by sending a CRLF, or it can close its side of the connection. If the host on the other end gets tired of waiting, it can also close the connection.

    Sending a Content-Length header is easy if the sender knows the length of the content (for example, if the content is a file), but often message bodies are generated dynamically and their complete length isn't known until the entire message has been generated, which might take a long time. So another mechanism was needed to cover that use case: so-called "chunking".

    In a chunked message, the body is divided by the sender into arbitrary-length chunks. Each chunk starts with its length, and is followed by a CRLF. The sender indicates that the message has been completely sent by sending a zero-length chunk (that is, a line containing the character 0 followed by a CRLF).

    Chunking lets the sender send dynamically-generated messages of unknown length. All that it needs to do is to accumulate some bytes of the message and send those as a chunk. It can accumulate a fixed-length buffer, or for a fixed amount of time, or using any other criterion. (Some embedded libraries just turn each send() call into a single, often very small, chunk. That's fine, too. Chunking has no semantic function; the end of a chunk can be anywhere, even in the middle of a multibyte UTF-8 code.)

    That's a very condensed overview of how HTTP allows multiple messages to be sent; I've left out a lot of details. You should consult the actual protocol specification if you want to write an implementation.