Search code examples
javastringperformanceapache-stringutils

Understanding StringUtils.join performance decisions


I was looking at the implementation of Apache Commons' StringUtils.join method and stumbled upon a line I assume is thought for performance but I don't understand why they did it the way it is, with those specific values.

Here's the implementation:

public static String join(Object[] array, String separator, int startIndex, int endIndex) {
    if (array == null) {
        return null;
    }
    if (separator == null) {
        separator = EMPTY;
    }

    // endIndex - startIndex > 0:   Len = NofStrings *(len(firstString) + len(separator))
    //           (Assuming that all Strings are roughly equally long)
    int noOfItems = (endIndex - startIndex);
    if (noOfItems <= 0) {
        return EMPTY;
    }

    StringBuilder buf = new StringBuilder(noOfItems * 16); // THE QUESTION'S ABOUT THIS LINE

    for (int i = startIndex; i < endIndex; i++) {
        if (i > startIndex) {
            buf.append(separator);
        }
        if (array[i] != null) {
            buf.append(array[i]);
        }
    }
    return buf.toString();
}

My questions regard the StringBuilder buf = new StringBuilder(noOfItems * 16); line:

  • I assume giving the StringBuilder an initial capacity targets performance so less resizing is needed while building the string. My question is: how much does these resizing operations actually hurt performance? Does this strategy really improve efficiency in terms of speed? (Because in term of space it could even be negative if more space than necessary is allocated)
  • Why is the magic number 16 being used? Why would they assume each String in the array would be 16 characters long? What good does this guess do?

Solution

  • 16 is a slight over-estimate (presumably based on experience/statistics) of the expected average size of the strings-with-separator.

    Pre-allocating enough space to hold the entire result avoids replacing the backing array during execution with a larger (double the size) array and copying over the elements (which is an O(n) operation).

    Over estimating, even by quite a bit, to allocate a larger array is worth the cost if it avoids the replacement operation in most situations.