Search code examples
c#https.net-coretcpclientsslstream

Content-Length can't be trusted when reading a response from SslStream?


Playing with TcpClient and NetworkStream on .NET Core 2.2.
Trying to get the content from https://www.google.com/

Before I continue, I'd like to make clear that I do NOT want to use WebClient, HttpWebRequest or HttpClient classes. There are a lot of questions where people had encountered some problems using TcpClient and where responders or commenters have suggested the use of something else for this task, so please don't.

Let's say we have an instance of SslStream obtained from TcpClient's NetworkStream and properly authenticated.

Let's say that also have one StreamWriter that we use to write HTTP messages to this stream and one StreamReader that we use to read HTTP message headers from the response:

var tcpClient = new TcpClient("google.com", 443);
var stream = tcpClient.GetStream();
var sslStream = new SslStream(stream, false);
sslStream.AuthenticateAsClient("google.com");
var streamWriter = new StreamWriter(sslStream);
var streamReader = new StreamReader(sslStream);

Say we send a request in the same way as a Firefox browser would have sent one:

GET / HTTP/1.1
Host: www.google.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: sr,sr-RS;q=0.8,sr-CS;q=0.6,en-US;q=0.4,en;q=0.2
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Cache-Control: max-age=0

Which causes the following response to be sent:

HTTP/1.1 200 OK
Date: Sun, 28 Apr 2019 17:28:27 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=UTF-8
Strict-Transport-Security: max-age=31536000
P3P: CP="This is not a P3P policy! See g.co/p3phelp for more info."
Content-Encoding: br
Server: gws
Content-Length: 55786
... etc

Now, after reading all response headers using streamReader.ReadLine() and parsing the content length found in the response header, let's read the response content into a buffer:

var totalBytesRead = 0;
int bytesRead;
var buffer = new byte[contentLength];
do
{
    bytesRead = sslStream.Read(buffer,
        totalBytesRead,
        contentLength - totalBytesRead);
    totalBytesRead += bytesRead;
} while (totalBytesRead < contentLength && bytesRead > 0);

However, this do..while loop will only exit after the connection has been closed by the remote server, which means the last call to Read will hang. Which means we've already read the entire response content, and the server is already listening for another HTTP message on this stream. Is the contentLength incorrect? Does the streamReader read too much when calling ReadLine and therefore does it mess up the SslStream position, which causes invalid data to be read?

What gives? Has anyone had experience with this?

P.S. Here is a sample console app code with all safety checks omitted which demonstrates this:

private static void Main(string[] args)
{
    using (var tcpClient = new TcpClient("google.com", 443))
    {
        var stream = tcpClient.GetStream();
        using (var sslStream = new SslStream(stream, false))
        {
            sslStream.AuthenticateAsClient("google.com");
            using (var streamReader = new StreamReader(sslStream))
            using (var streamWriter = new StreamWriter(sslStream))
            {
                streamWriter.WriteLine("GET / HTTP/1.1");
                streamWriter.WriteLine("Host: www.google.com");
                streamWriter.WriteLine("User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0");
                streamWriter.WriteLine("Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
                streamWriter.WriteLine("Accept-Language: sr,sr-RS;q=0.8,sr-CS;q=0.6,en-US;q=0.4,en;q=0.2");
                streamWriter.WriteLine("Accept-Encoding: gzip, deflate, br");
                streamWriter.WriteLine("Connection: keep-alive");
                streamWriter.WriteLine("Upgrade-Insecure-Requests: 1");
                streamWriter.WriteLine("Cache-Control: max-age=0");
                streamWriter.WriteLine();
                streamWriter.Flush();

                var lines = new List<string>();
                var line = streamReader.ReadLine();
                var contentLength = 0;
                while (!string.IsNullOrWhiteSpace(line))
                {
                    var split = line.Split(": ");
                    if (split.First() == "Content-Length")
                    {
                        contentLength = int.Parse(split[1]);
                    }

                    lines.Add(line);
                    line = streamReader.ReadLine();
                }

                var totalBytesRead = 0;
                int bytesRead;
                var buffer = new byte[contentLength];
                do
                {
                    bytesRead = sslStream.Read(buffer,
                        totalBytesRead,
                        contentLength - totalBytesRead);
                    totalBytesRead += bytesRead;
                    Console.WriteLine(
                        $"Bytes read: {totalBytesRead} of {contentLength} (last chunk: {bytesRead} bytes)");
                } while (totalBytesRead < contentLength && bytesRead > 0);

                Console.WriteLine(
                    "--------------------");
            }
        }
    }

    Console.ReadLine();
}

EDIT

This always happens after I submit a question. I've been scratching my head for a couple of days without being able to find the cause of the problem, but as soon as I submitted it, I knew it was something to do with StreamReader messing things up when trying to read a line.

So if I stop using the StreamReader and replace calls to ReadLine with something that reads byte-by-byte, everything seems to be fine. The replacement code can be written as the following:

private static IEnumerable<string> ReadHeader(Stream sslStream)
{
    // One-byte buffer for reading bytes from the stream
    var buffer = new byte[1];

    // Initialize a four-character string to keep the last four bytes of the message
    var check = new StringBuilder("....");
    int bytes;
    var responseBuilder = new StringBuilder();
    do
    {
        // Read the next byte from the stream and write in into the buffer
        bytes = sslStream.Read(buffer, 0, 1);
        if (bytes == 0)
        {
            // If nothing was read, break the loop
            break;
        }

        // Add the received byte to the response builder.
        // We expect the header to be ASCII encoded so it's OK to just cast to char and append
        responseBuilder.Append((char) buffer[0]);

        // Always remove the first char from the string and append the latest received one
        check.Remove(0, 1);
        check.Append((char) buffer[0]);

        // \r\n\r\n marks the end of the message header, so break here
        if (check.ToString() == "\r\n\r\n")
        {
            break;
        }
    } while (bytes > 0);

    var headerText = responseBuilder.ToString();
    return headerText.Split("\r\n", StringSplitOptions.RemoveEmptyEntries);
}

...which would then make our sample console app look like this:

private static void Main(string[] args)
{
    using (var tcpClient = new TcpClient("google.com", 443))
    {
        var stream = tcpClient.GetStream();
        using (var sslStream = new SslStream(stream, false))
        {
            sslStream.AuthenticateAsClient("google.com");
            using (var streamWriter = new StreamWriter(sslStream))
            {
                streamWriter.WriteLine("GET / HTTP/1.1");
                streamWriter.WriteLine("Host: www.google.com");
                streamWriter.WriteLine("User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0");
                streamWriter.WriteLine("Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
                streamWriter.WriteLine("Accept-Language: sr,sr-RS;q=0.8,sr-CS;q=0.6,en-US;q=0.4,en;q=0.2");
                streamWriter.WriteLine("Accept-Encoding: gzip, deflate, br");
                streamWriter.WriteLine("Connection: keep-alive");
                streamWriter.WriteLine("Upgrade-Insecure-Requests: 1");
                streamWriter.WriteLine("Cache-Control: max-age=0");
                streamWriter.WriteLine();
                streamWriter.Flush();

                var lines = ReadHeader(sslStream);
                var contentLengthLine = lines.First(x => x.StartsWith("Content-Length"));
                var split = contentLengthLine.Split(": ");
                var contentLength = int.Parse(split[1]);

                var totalBytesRead = 0;
                int bytesRead;
                var buffer = new byte[contentLength];
                do
                {
                    bytesRead = sslStream.Read(buffer,
                        totalBytesRead,
                        contentLength - totalBytesRead);
                    totalBytesRead += bytesRead;
                    Console.WriteLine(
                        $"Bytes read: {totalBytesRead} of {contentLength} (last chunk: {bytesRead} bytes)");
                } while (totalBytesRead < contentLength && bytesRead > 0);

                Console.WriteLine(
                    "--------------------");
            }
        }
    }

    Console.ReadLine();
}

private static IEnumerable<string> ReadHeader(Stream sslStream)
{
    // One-byte buffer for reading bytes from the stream
    var buffer = new byte[1];

    // Initialize a four-character string to keep the last four bytes of the message
    var check = new StringBuilder("....");
    int bytes;
    var responseBuilder = new StringBuilder();
    do
    {
        // Read the next byte from the stream and write in into the buffer
        bytes = sslStream.Read(buffer, 0, 1);
        if (bytes == 0)
        {
            // If nothing was read, break the loop
            break;
        }

        // Add the received byte to the response builder.
        // We expect the header to be ASCII encoded so it's OK to just cast to char and append
        responseBuilder.Append((char)buffer[0]);

        // Always remove the first char from the string and append the latest received one
        check.Remove(0, 1);
        check.Append((char)buffer[0]);

        // \r\n\r\n marks the end of the message header, so break here
        if (check.ToString() == "\r\n\r\n")
        {
            break;
        }
    } while (bytes > 0);

    var headerText = responseBuilder.ToString();
    return headerText.Split("\r\n", StringSplitOptions.RemoveEmptyEntries);
}

Solution

  • The answer to the question in the title is YES.
    It can be trusted, as long as you read the message header properly, i.e. do not use StreamReader.ReadLine.

    Here is a utility method which does the job:

    private static string ReadStreamUntil(Stream stream, string boundary)
    {
        // One-byte buffer for reading bytes from the stream
        var buffer = new byte[1];
    
        // Initialize a string builder with some placeholder chars of the length as the boundary
        var boundaryPlaceholder = string.Join(string.Empty, boundary.Select(x => "."));
        var check = new StringBuilder(boundaryPlaceholder);
        var responseBuilder = new StringBuilder();
        do
        {
            // Read the next byte from the stream and write in into the buffer
            var byteCount = stream.Read(buffer, 0, 1);
            if (byteCount == 0)
            {
                // If nothing was read, break the loop
                break;
            }
    
            // Add the received byte to the response builder.
            responseBuilder.Append((char)buffer[0]);
    
            // Always remove the first char from the string and append the latest received one
            check.Remove(0, 1);
            check.Append((char)buffer[0]);
    
            // boundary marks the end of the message, so break here
        } while (check.ToString() != boundary);
    
        return responseBuilder.ToString();
    }
    

    Then, to read the header, we can just call ReadStreamUntil(sslStream, "\r\n\r\n").

    The key here is to read the stream byte by byte until a known byte sequence (in this case \r\n\r\n) is encountered.

    After it's been read by using this method, the stream will be at the correct position for the response content to be read properly.

    If any good, this method can easily be converted to async variant by calling await ReadAsync instead of Read.

    It's worth noting that the above method only works fine if the text is ASCII encoded.