Parsing HTTP body in Java

What I am trying to achieve:

I am currently developing a proxy where the goal is to alter the body received in the server's response. (The proxy shall only support HTTP, not HTTPS).

As an example of what I want the proxy to accomplish:

The client (browser) sends a HTTP GET request to the proxy, it is then parsed and redirected to the correct host. The host (server) will then respond with a 200 OK the HTML file. The HTML file is then parsed in the proxy and altered. The proxy then change the Content-Length and other headers if necessary and sends it back to the client. The client will now see an altered version of the the HTML file that the proxy received from the server.

The issue:

The proxy seems to have an issue with UTF-8 and other encodings where the font can't recognize certain characters. What happens is that when I read using a Socket's InputStream, it times out because it believes it has not read enough bytes (according to the Content-Length). When the HTML file is returned to the browser, a lot of "diamonds with a question mark inside" appears. Which, according to my research, is when the font can't load a character. It can vary between fonts.

It works fine on websites that don't have "weird characters". When reading the body it stops before reading the entire body. For example: In one case I had a body that contained 179643 bytes, and it stopped reading when my bodyLength had the value of ~3000 bytes. It then timed out, causing a 5 sec delay between the server and client. The content was all correct, it is just not calculated the correct way in the while loop.

I have this code snippet that causes issues (This code handles the response to a Socket)

private Response getResponse(final Socket socket) {
    try {

        HashMap<String, String> headers;
        StringBuilder builder = new StringBuilder();
        BufferedReader stream = new BufferedReader(new InputStreamReader(socket.getInputStream()));

        //-- READ FIRST LINE --//
        // We assume that it is a valid response! (TODO)
        String[] firstLine = stream.readLine().split(" ",3);
        headers = getHeaders(stream);

        //--- GET BODY ---//
        String contentLength = headers.get("Content-Length");
        //Check if body exists
        if(contentLength != null) {
            int bodyLength = Integer.parseInt(contentLength);

            String s;
            //The issue occurs in this while loop!
            while(bodyLength > 0 && (s = stream.readLine()) != null) {
                bodyLength -= (s+"\n").getBytes(StandardCharsets.UTF_8).length;
                builder.append(s).append("\n");
            }

        }

        //-- Return Request --//
        int code = Integer.parseInt(firstLine[1]);
        return new Response(headers,builder, firstLine[0],code, firstLine[2]);
    }
    catch (IOException e) {
        e.printStackTrace();
        return null;
    }
}

(NOTE: I know that parsing the entire body as Strings is not efficient and that a "real" proxy would just pass the bytes along. I am, however, according to my knowledge forced to read the Strings as I have to change the contents. I should also state that I am not allowed to use libraries!)

Above you can see that I have StandardCharsets.UTF_8. This is temporary and I have this because the page that “timed out” used UTF-8 as encoding. I am trying to make this example work before moving on and implementing a better solution.

I believe that the issue has to do with the while loop in the above code.
What this method should do is:

Parse the first line (e.g GET URL PROTOCOL).
Get the headers from the response and put them into a HashMap object for easy use.
We check if there is any body, if so we enter a loop in which we do step 4,5,6 below.
Read the string and add it to the StringBuilder.
Subtract the length of the String from the bodyLength variable, which is the content length as an integer.
Loop while bodyLength > 0, because if it is 0 we are done.
When done we can return the entire request as a Request object. (This class is custom and basically just contains the headers, body, etc.)

I only posted the method above as it is there the actual "encoding issue" occurs. Seeing the other methods would just make the question lose its precision in my opinion. If you wonder about other parts of the code, feel free to ask a question in the comments!

Now to the actual question:
How do I solve this issue? Strings use UTF-16 in Java so does the encoding "disappear" when I read a String from the InputStream? For example: If I in the snippet above in the beginning instead put InputStreamReader(socket.getInputStream(), "UTF-8"), then - shouldn't the Strings be UTF-8 when read from the stream? Or are they immediately converted to UTF-16 when set as a String object?

What have I tried?

I tried doing the following: InputStreamReader(socket.getInputStream(), "UTF-8") Although this, combined with doing the same thing for the output stream, makes the "diamonds with questionmarks" disappear - it does not solve the timeout issue.

I tried parsing the body as bytes, but somehow this ended up not working at all. Not only this, but it would not be easy to replace the contents of the body with this approach (that I know of).

Solution

You're doing it the wrong way around. You need characters, and then convert those back to bytes in order to keep track of your Content-Length.

That's wrong - read bytes, do the 'math' on how many bytes are left, then convert THOSE. Which isn't necessarily easy - you could be reading 'half' a character.

More generally there are libraries to do this for you. HTTP is surprisingly complicated, it's bizarre to want to write an entire web server, especially when you are still working with the level of experience that evidently isn't sufficient to realize basic mistakes like this. It's not your fault; HTTP seems really simple, so simple, you thought: Heck, I'll give it a shot. But, don't do that.

One of the complex aspects to HTTP is that it's a mixed mode protocol: The request itself and the headers are character-based, but then the content is byte-based. Note that the preamble (the headers and such) are US_ASCII. Not UTF8. This shouldn't ordinarily matter (if truly everything is sent in ASCII, a UTF-8 parser will read it just the same), but it does if the input is invalid. I can tell you some extremely deep-down-the-rabbit-hole stories about how accepting things that other servers do not accept leads to security issues, so, don't do that.

There are ways to write it correctly; of course there is, there are plenty of HTTP servers written in java, after all. So, why not use one of them? There's jetty which is very pluggable and controllable, for example - and 100% a java solution. Just add some jars, all you need to do.

If you insist on doing it yourself, know that this is merely the first of about 5000 questions, and the odds that your final, working product (if you ever get that far) is truly 'good' is effectively zilch. It is virtually guaranteed it has some security issue, probably a major one, and it is virtually guaranteed some browser or server or some exotic combination of the two is going to fail if your proxy is in the middle of it.

If you insist, this is the strategy:

Realize that HTTP is fundamentally byte based. If you write new InputStreamReader, you lost the game.
To read the 'stringy' parts, you read data in byte form until a known end point (e.g. a newline symbol, signalling that the GET /path HTTP/1.1 line is now done), and then take the entire byte array that contains the 'line' and convert that to a string, e.g. using new String(byteArr, 0, pos, StandardCharsets.US_ASCII), and then parse that string (e.g. store it in your header map or read the HTTP method out of it).
For the HTTP request body, read bytes out and pass them on through to an InputStreamReader, but separate it out: you can't convert to characters and then count how many bytes you read. It just doesn't work that way.
ByteBuffer and Channel is the newer API and it probably will work a lot better especially if you want to efficiently deal with a 'mixed mode' channel that is going to send a ton of data.
... but, really, abort. For example, the Range mechanic built into HTTP, used to request a chunk of a resource and required, more or less, to host videos (as web video players use this continuously to stream the video file), is completely crazy and doesn't work anything like one would expect. There's chunked encoding which can be slightly odd. There are all sorts of bizarro caveats that web servers have taken care of, up to and including parsing out the User-Agent string to change behaviour (such as ignoring the indication from a browser that they can handle gzip compression when the UA says it's IE6 and the resource asked is css or js. Which IE6 can't actually read if compressed even if it says it can. Fortunately IE6 is dead and buried but it's not the only bizarro thing that just about every web server has hacked around. No, you won't find that in any spec. That's my point. The amount of domain knowledge that web server authors have is mind boggling and you will spend the next 20 years rediscovering it all if you try to write this on your own. When I said 'HTTP is actually quite complicated', perhaps now you start to see how complicated I mean).

Given that so far you're just appending it all to a StringBuilder, i.e. you don't seem to care about being able to deal with very large input, you could just stream all data into a byte array until it's ALL received, then convert the entire byte array to a string, which completely solves the current problem you are having. It won't solve the 5000 other problems you're going to have in the near future, of course.