Search code examples
unicodesuperscriptglyphcombining-marks

Is there a Unicode 'combiner' akin to a superscript style?


Looking at how we handle superscripts (and subscripts). I see that on the one hand they are treated like a style.

i.e.

x<sup>y</sup>

becomes:

xy

But in Unicode we seem to have superscripts and subscripts instead as individual glyphs.

For example:

x U+207f

becomes:

xⁿ

I guess it makes sense to encode common uses this way as it is more compressed. Is there a combiner (if that's the correct term) in Unicode that means treat some following symbols as superscripted and if not why not?

The context is https://langdev.stackexchange.com/a/1962/285 where we are talking about representing exponentiation in a programming language.

It would be nice to have a unicode 'value' (combiner not character?) that can represent the exponentiation operation and render it as a superscript.

So that instead of writing:

 x**y

You could write

x &xSomeValue; y

and have it render as:

Does such a thing exist in unicode and if not what is the rationale behind unicode using something else (such as only superscripts for specific glyphs) instead?

There is an existing question with an answer for one part of the question as:

"Unicode does not support making arbitrary characters into superscripts."

It does not answer the rationale part. Also it is possible the situation may have changed in the last three years.


Expand on rationale

It seems to me that a more rationale design for Unicode would be to take one of the following choices:

  • provide super and subscript versions of all characters that could exist in that position

  • provide a "super" combiner that turns the next single symbol into a superscript version of itself.

  • treat superscripts like combining glyphs into ideograms using Ideographic Description Sequences E.g.

    2^(a+b) -> 2a+b

    where ^( and ) would be special Unicode 'combiners'.

Why has Unicode chosen (if it has) not to take one or more of these approaches?

The first option requires many symbols. The second option is super simple but potentially could make more symbol representable than intended (e.g. a superscript smiley) so you might have to add rules about that. The third option is more like encode style than symbol.

What we currently have seems worse than all three. The Unicode designers are not stupid so they must be prioritising something else. What and why?


Slightly related I cannot think of a maths symbol for exponentiation. Typically we use ^ in programming. i.e.

xʸ = x^y

An up arrow has also been suggested but this doesn't look right to me:

x↑y

Another aside xʸ (x^y) is how exponentiation is typically displayed on a calculator. Why is there no Unicode codepoint for this?


Solution

  • The term is combining character as opposed to precomposed character. Such superscript combining characters don't exist because subscript or superscript is a formatting feature. Unicode is just a character set for mapping between characters/glyphs to numbers. It only deals with plain text and is not supposed for formatting text

    Rich Text. Also known as styled text. The result of adding information to plain text. Examples of information that can be added include font data, color, formatting information, phonetic annotations, interlinear text, and so on. The Unicode Standard does not address the representation of rich text. It is expected that systems and applications will implement proprietary forms of rich text. Some public forms of rich text are available (for example, ODA, HTML, and SGML). When everything except primary content is removed from rich text, only plain text should remain.

    https://unicode.org/glossary/#rich_text (emphasis mine)

    You can't make a letter bold, italic or move a letter to above or below the baseline purely with the Unicode code points. Therefore it has no way to format math expressions either (except for very simple ones)

    You can find more rationales from the Unicode standard:

    Q: What is the difference between “rich text” and “plain text”?

    A: Rich text is text with all its formatting information: typeface, point size, weight, kerning, and so on. Plain text is the underlying content stream to which formatting is applied.

    One key distinction between the two is that rich text breaks the text up into runs and applies uniform formatting to each run. As such, rich text is inherently stateful. Plain text is not stateful. It should be possible to lose the first half of a block of plain text without any impact on rendering.

    Unicode, by design, only deals with plain text. It doesn't provide a generalized solution to rich text issues.

    Q: Why doesn't Unicode have a full set of superscripts and subscripts?

    A: The superscripted and subscripted characters encoded in Unicode are either compatibility characters encoded for roundtrip conversion of data from legacy standards, or are actually modifier letters used with particular meanings in technical transcriptional systems such as IPA and UPA. Those characters are not intended for general superscripting or subscripting of arbitrary text strings—for such textual effects, you should use text styles or markup in rich text, instead.

    Q: I've spotted a sign which uses superscript text for a meaningful abbreviation. Doesn't that mean that all the superscripted letters should be encoded in Unicode?

    A: No. It's common for specific formatting to be used to convey some of the semantic content—the meaning—of a text. As for italics, bold, or any other stylistic effect of this sort conveying meaning, the appropriate mechanism to use in such cases is style or markup in rich text.

    https://www.unicode.org/faq/ligature_digraph.html

    That means you must use a math rendering tool like LaTeX, MS Equation Editor, MathType, MathML... One the simplest math renderers if you don't like LaTex is AsciiMath, but typically LaTeX is the "standard"