Search code examples
javastringperformancememory-managementlanguage-design

Does it make sense to "waste" 8 bytes per String instance for offset/count?


Strings in Java support structural sharing for some methods like substring, which means that supposedly immutable data doesn't need to be copied (which (unexpectedly) keeps large char arrays alive which would have been GC'd otherwise.)

This feature is implemented with two fields offset and count which are set accordingly when a String is substringed in Java.

Considering that .NET doesn't do this and claims that "O(n) is O(1) if n does not grow large", would a slightly different design of Strings make sense which accommodates both requirements?

E. g. would it make sense to have a sealed, memory-efficient, general purpose version of String which doesn't have these superfluous fields and a subclass "SubString" which is only returned by substring methods and has the additional fields to avoid copying?

Rough sketch:

sealed class String {
  val codeunits: Array[Char] = ...
  def length = codeunits.length

  def substring: SubString = ...

  ...
}

final class SubString extends String {
  val offset: Int = ...
  override def length = codeunits.length - offset /* and so on */

  ...
}

Solution

  • What you suggest could make the common case more efficient in terms of memory and cpu.

    You may be interested to know the JVM can change this without a code change. The Sun/Oracle JVM currently uses a byte[] automagically when the characters fit into bytes without loss.

    In any case its the sort of thing you would want the JVM to do for you transparently, like -XX:+UseCompressedStrings does.