Search code examples
stringlanguage-agnosticunicodeprogramming-languagesencoding

Is there a programming language with full and correct Unicode support?


Most programming languages have some support for Unicode, but all have some more or less documented corner cases, where things won't work correctly.


Examples

Java: reverse() in StringBuilder/StringBuffer work correctly. But length(), charAt(), etc. in String do not if a character needs more than 16bit to encode.

C#: Didn't find a correct reverse method, Length and indexed access return wrong results.

Perl: Same problem.

PHP: Does not have an idea of Unicode at all, mbstring has some better working replacements.


I wonder if there is a programming language, which has full and correct Unicode support? What compromises had to be made there to achieve such a thing?

  • More complex algorithms?
  • Higher memory consumption?
  • Slower performance?

How was it implemented internally?

  • Array of Ints, Linked Lists, etc.
  • Additional buffering

I saw that Python 3 had some pretty big changes in this area. How close is Python 3 now to a correct implementation?


Solution

  • Thought this is 10 years old question,...

    Yes. Swift does.

    • Basic string type String performs all character handling at Unicode "Grapheme Cluster" level. Therefore you are enforced to perform every text mutating operations in "Unicode-correct" manner at "human-perceived character" level.

    • The String type is abstracted data type and does not expose its internal representations, but it has interfaces to access Unicode Scalar Values and Unicode Code Units for all of UTF-8, UTF-16, UTF-32 encodings.

    • It also stores breadcrumbs to provide offset conversion between UTF-8 and UTF-16 in amortized O(1) time.

    • Character type also provide decomposition into Unicode Scalar Values.

    • Character type has multiple character classification methods that are based on Unicode semantics. For example, Character.isNewline returns true for all new-lines strings including LF,VT,FF,CR,CR-LF,NEL, ... that are defined in Unicode standard.

    • Though it's abstracted, Swift 5.x internally stores strings in UTF-8 encoded form by default. It's possible to access them in strict O(1) time so you can use UTF-8 based functions without sacrificing performance.

    • "Unicode" in Swift covers "all" characters defined in Unicode standard and not limited to BMP.

    • String, Character and all of their derived view types like UTF8View, UTF16View, UnicodeScalarView conform BidirectionalCollection protocol, so you can iterate components bi-directionally in all supported segmentation levels. They all share same index type so indices obtained from one view can be used on another view if they points correct Grapheme Cluster boundaries.