Search code examples
javastringunicodecharacter-encodingstandards

How to correctly compute the length of a String in Java?


I know there is String#length and the various methods in Character which more or less work on code units/code points.

What is the suggested way in Java to actually return the result as specified by Unicode standards (UAX#29), taking things like language/locale, normalization and grapheme clusters into account?


Solution

  • java.text.BreakIterator is able to iterate over text and can report on "character", word, sentence and line boundaries.

    Consider this code:

    def length(text: String, locale: java.util.Locale = java.util.Locale.ENGLISH) = {
      val charIterator = java.text.BreakIterator.getCharacterInstance(locale)
      charIterator.setText(text)
    
      var result = 0
      while(charIterator.next() != BreakIterator.DONE) result += 1
      result
    }
    

    Running it:

    scala> val text = "Thîs lóo̰ks we̐ird!"
    text: java.lang.String = Thîs lóo̰ks we̐ird!
    
    scala> val length = length(text)
    length: Int = 17
    
    scala> val codepoints = text.codePointCount(0, text.length)
    codepoints: Int = 21 
    

    With surrogate pairs:

    scala> val parens = "\uDBFF\uDFFCsurpi\u0301se!\uDBFF\uDFFD"
    parens: java.lang.String = 􏿼surpíse!􏿽
    
    scala> val length = length(parens)
    length: Int = 10
    
    scala> val codepoints = parens.codePointCount(0, parens.length)
    codepoints: Int = 11
    
    scala> val codeunits = parens.length
    codeunits: Int = 13
    

    This should do the job in most cases.