Search code examples
javastringperformancememory-managementgarbage-collection

How to process String stream effectively?


I've attended interview session and I was asked about ways of optimization of such pseudo-code. It works under the load. It utilize 100% CPU and 100% Memory. After some discussion it was mentioned that it is because of GC.

public class MyString processor {
    ....
    public String process(InputStream inputStream) {
       String input = getString(inputStream);
       int position = input.length() / 2;
       return input.substring(0, position)
            + "some_constant_string_inside" + input.substring(position);
    }
}

As we can see here we create 2 strings during this call. and if this method is invoked 1000 times then 2000 String will be created. So it is a reason of memory consumption which is a root cause of unstopable GC work.

We agreed that it is a bad idea to create so many Strings.

First step was reading char array from socket and then we can modify this array inserting some string in the bigining. (Here I still have aт unanswered question how can we create char_array with size = char_array_size_from_socket + some_constant_space_for_sting_in_the_beginning - from my view there are 2 arrays will be created)

In the end we decided that it is better to have char array buffer per thread and store it in thread local to avoid race conditions and concurrent reads.

Do you have any other ideas? Could you please provide pseudocode for a such solution taking into account that string size is unknown value and could be different for each string


Solution

  • Your code doesn’t create two String objects per invocation, but four String instances, due to the use of substring. Unless you’re using javac with a target of Java 9 or newer, there will also be a StringBuilder instance under the hood, to perform the string concatenation. With the common compiler implementations, it will not make any attempt to precalculate the size, but start with the default capacity. So when we assume the substrings to be rather large, the internal capacity will be insufficient at each append call, causing a reallocation of the underlying array, including another copy operation for all but the first append call.

    The simplest approach to avoid this multiple copying of the same character contents, is to change the method to

    public String process(InputStream inputStream) {
        String input = getString(inputStream);
        int position = input.length() / 2;
        return new StringBuilder(input.length() + "some_constant_string_inside".length())
            .append(input, 0, position)
            .append("some_constant_string_inside")
            .append(input, position, input.length())
            .toString();
     }
    

    This basically does the same as your original method, so it doesn’t require any changes to the surrounding code. Compared to the string concatenation, as compiled prior to JDK 9, it pre-sizes the StringBuilder and it avoids the creation of the two substrings. In the best case, the JVM’s optimizer will recognize this pattern and pass the StringBuilder’s internal array directly to the new String’s backing array without a defensive copy, as the StringBuilder provenly won’t be accessed after the string creation.

    So this reduces the number of copying operations from five times to two times or even one time when the JIT optimization takes place. This does not account for the copying/ charset decoding operations taking place within the getString method so far. When using javac and JDK 9+, the new string concatenation may perform better than StringBuilder, but since this would still require the substring operations, the manual StringBuilder use may still win.

    If changes to the the surrounding code are allowed, we would need more information about it, to know what we are comparing with and which changes are possible. E.g. for a lot of purposes, a String result isn’t actually needed, but a CharSequence would do. So if you change the return type of process to CharSequence, you can omit the final toString() call and return the StringBuilder. So you would not need the JIT to eliminate the final copying step.

    public CharSequence process(InputStream inputStream) {
        String input = getString(inputStream);
        int position = input.length() / 2;
        return new StringBuilder(input.length() + "some_constant_string_inside".length())
            .append(input, 0, position)
            .append("some_constant_string_inside")
            .append(input, position, input.length());
     }
    

    Note that the process method is such a method where a String is not required but any CharSequence would do. So we can change the return type of getString to CharSequence and enable a wide range of optimization options, depending on the actual requirements.

    public CharSequence process(InputStream inputStream) {
        CharSequence input = getString(inputStream);
        int position = input.length() / 2;
        return new StringBuilder(input.length() + "some_constant_string_inside".length())
            .append(input, 0, position)
            .append("some_constant_string_inside")
            .append(input, position, input.length());
     }
    

    With this change, you may read into a reusable CharBuffer, which conveniently implements CharSequence already. You can also deal with char[] array manually and wrap the result upon return.

    Another option for certain character encodings is to read into a byte[] array instead, skipping the charset decoding that would happen when using InputStreamReader, followed by returning a custom CharSequence which wraps the array. This works if the character encoding has a direct mapping from bytes to characters or if you have sufficient knowledge about the content, e.g. when the encoding is supposed to be UTF-8, but the specific content is known to be completely within the ASCII range.

    This would halve the amount of memory due to the use of byte[] instead of char[] but also eliminate the implied copying normally happening with charset decoding operations (as you need a source ByteBuffer and target CharBuffer existing at the same time).

    The most recent version of Files.readString(…) does a similar optimization. In the best case, it will return a String encapsulating the same array into which the file was read. Of course, application code can not go that far.

    In principle, you could change the getString method to return a mutable CharSequence implementation, so process can insert the constant and return that very instance, but mind that this precludes any reusing of the buffer and further, inserting in the middle still requires copying all characters after that position, i.e. half the string, to a new position. If the capacity is not sufficient to hold the string to insert, it would imply a copying of the entire contents into a new buffer anyway. So, the effect of this optimization is not as large as you might expect.