Search code examples
unicodeutf-8character-encodingutf-16non-ascii-characters

BOCU-1 for internal encoding of strings


Some languages/platforms like Java, Javascript, Windows, Dotnet, KDE etc. use UTF16. Some others prefer UTF8.

What is the reason that no language/platform uses BOCU-1? What is the rationale for JEP 254 and JEP 254 equivalent for Dotnet?

Is the reason that BOCU-1 is patented? Are there any technical reasons also?


Edit

My question is not about Java specifically. By JEP 254, I mean compact UTF-16 as mentioned in that proposal. My question is, since BOCU-1 is compact for almost any unicode string, why don't any language/platform use it internally, instead of UTF-16 or UTF-8. Such a usage would improve cache performance for any string, and not just ASCII or Latin-1.

Such a usage may also help in non-Latin programming language support in formats like The Language Server Index Format (LSIF).


Solution

  • What is the reason that no language/platform uses BOCU-1?

    That question is far too broad in scope for Stack Overflow, and a concise answer is impossible.

    However, in the specific case of Java note that someone raised the possibility of Java adopting BOCU-1 as an RFE (Request For Enhancement) in 2002. See JDK-4787935 (str) Reducing the memory footprint for Strings.

    That bug was closed with a resolution of "Won't Fix" ten years later:

    "Although this is a very interesting proposal, it is highly unlikely that BOCU or any other multi-byte encoding for internal use would be adopted. Furthermore, this comes down to a space-time tradeoff with unclear long-term consequences. Given the length of time this proposal has lingered, it seems appropriate to close it as will not fix".

    What is the rationale for JEP 254...?

    There is a section of JEP 254 titled "Motivation" which explains that, and in particular it states "most String objects contain only Latin-1 characters". However, if that does not satisfy you, raise a separate question.

    Ensure that it is on topic for Stack Overflow by reviewing What topics can I ask about here? first. Two of the people who reviewed JEP 254 (Aleksey Shipilev and Brian Goetz) respond here on SO, so you may get an authoritative answer.

    What is the rationale for ... JEP 254 equivalent for Dotnet?

    Again, raise this as a separate SO question.

    Is the reason that BOCU-1 is patented?

    That question is specifically off topic here: "Legal questions, including questions about copyright or licensing, are off-topic for Stack Overflow", though Wikipedia notes "BOCU-1 is the only Unicode compression scheme described on the Unicode Web site that is known to be encumbered with intellectual property restrictions".

    Are there any technical reasons also?

    A very important non-technical reason is that the HTML5 specification explicitly forbids the use of BOCU-1!...

    Avoid these encodings
    
    The HTML5 specification calls out a number of encodings that you should avoid...
    
    Documents must also not use CESU-8, UTF-7, BOCU-1, or SCSU encodings, since they... were never intended for Web content and the HTML5 specification forbids browsers from recognising them.
    

    Of course that invites the question of why HTML 5 forbids the use of BOCU-1, and the only technical reason I can find for that is that this Mozilla documentation on HTML's <meta> element states:

    Authors must not use CESU-8, UTF-7, BOCU-1 and/or SCSU as cross-site scripting attacks with these encodings have been demonstrated.
    

    See this GitHub link for more details on the XSS vulnerability with BOCU-1.

    Also note that in line with the the HTML5 specification, all the major browsers specifically do not support BOCU-1.