Most programming languages have some support for Unicode, but all have some more or less documented corner cases, where things won't work correctly.
Examples
Java: reverse() in StringBuilder/StringBuffer work correctly. But length(), charAt(), etc. in String do not if a character needs more than 16bit to encode.
C#: Didn't find a correct reverse method, Length and indexed access return wrong results.
Perl: Same problem.
PHP: Does not have an idea of Unicode at all, mbstring has some better working replacements.
I wonder if there is a programming language, which has full and correct Unicode support? What compromises had to be made there to achieve such a thing?
How was it implemented internally?
I saw that Python 3 had some pretty big changes in this area. How close is Python 3 now to a correct implementation?
Thought this is 10 years old question,...
Yes. Swift does.
Basic string type String
performs all character handling at Unicode "Grapheme Cluster" level. Therefore you are enforced to perform every text mutating operations in "Unicode-correct" manner at "human-perceived character" level.
The String
type is abstracted data type and does not expose its internal representations, but it has interfaces to access Unicode Scalar Values and Unicode Code Units for all of UTF-8, UTF-16, UTF-32 encodings.
It also stores breadcrumbs to provide offset conversion between UTF-8 and UTF-16 in amortized O(1) time.
Character
type also provide decomposition into Unicode Scalar Values.
Character
type has multiple character classification methods that are based on Unicode semantics. For example, Character.isNewline
returns true
for all new-lines strings including LF,VT,FF,CR,CR-LF,NEL, ... that are defined in Unicode standard.
Though it's abstracted, Swift 5.x internally stores strings in UTF-8 encoded form by default. It's possible to access them in strict O(1) time so you can use UTF-8 based functions without sacrificing performance.
"Unicode" in Swift covers "all" characters defined in Unicode standard and not limited to BMP.
String
, Character
and all of their derived view types like UTF8View
, UTF16View
, UnicodeScalarView
conform BidirectionalCollection
protocol, so you can iterate components bi-directionally in all supported segmentation levels. They all share same index type so indices obtained from one view can be used on another view if they points correct Grapheme Cluster boundaries.