Search code examples
textunicodetext-processingunicode-range

Usable Unicode Ranges for Custom Text Process


I am working on a processor that parts texts into blocks with marks:

LOREM IPSUM SED AMED

will be parsed like:

{word:1}LOREM{/word:1}{space:2}
{word:3}IPSUM{/word:3}{space:4}
{word:5}SED{/word:5}{space:6}
{word:7}AMED{/word:7}

But I dont want to use "{word}" etc, because it causes processor down, because it is an string again... I need to mark like these:

\E002\0001 LOREM \E003\0001 \E004\0002
\E002\0003 IPSUM \E003\0004 \E004\0005
\E002\0006 SED   \E003\0006 \E004\0007
\E002\0008 AMED  \E003\0008
  • First \E002 means element type number, its last bit represent element's close. So element number increments with +2.
  • Second \0001 means element index for stacking.
  • I am just used \E002 irrelevantly for this example.

But \0001 also using in Unicode Range, and this leads me to where I start again...

So which unicode range can I use? \ff0000? or how can I solve this?

Thanks!


Solution

  • The Unicode Consortium thought of this. There is a range of Unicode code points that are meant to never represent a displayable character, but meta-codes instead:

    Noncharacters are code points that are permanently reserved and will never have characters assigned to them.
    ...
    Tag characters were intended to support a general scheme for the internal tagging of text streams in the absence of other mechanisms, such as markup languages. The use of tag characters for language tagging is deprecated.
    (http://www.unicode.org/versions/Unicode9.0.0/ch23.pdf)

    You should be able to use regular control characters as "private" tags, because these should never occur in proper strings. This would be the range from U+0000 to U+001F, excluding tab (U+0009), the common "returns" (U+000A and U+000D), and, for safety, U+0000 itself (some libraries do not like Null characters in the middle of strings).

    Non-characters
    Noncharacters are code points that are permanently reserved in the Unicode Standard for internal use. They are not recommended for use in open interchange of Unicode text data.

    You can use U+FEFF (which is currently officially defined as Not-A-Character), or U+FFFE and U+FFFF. There are several more "officially not-a-characters" defined, and you can be fairly sure they would not occur in regular text strings.

    A few random sequences with predefined definitions, and so highly unlikely to occur in plain text strings are:

    Specials: U+FFF0–U+FFF8
    The nine unassigned Unicode code points in the range U+FFF0..U+FFF8 are reserved for special character definitions.

    Annotation Characters: U+FFF9–U+FFFB
    An interlinear annotation consists of annotating text that is related to a sequence of annotated characters. For all regular editing and text-processing algorithms, the annotated characters are treated as part of the text stream. The annotating text is also part of the content, but for all or some text processing, it does not form part of the main text stream.

    Tag Characters: U+E0000–U+E007F
    This block encodes a set of 95 special-use tag characters to enable the spelling out of ASCIIbased string tags using characters that can be strictly separated from ordinary text content characters in Unicode.
    (all quotations from the chapter as above)


    Staying within conventions, you can also use U+2028 (line separator) and/or U+2029 paragraph separator.

    Technically, your use of U+E000U+F8FF (the "Private Use Area") is okay-ish, because these code points only can define an unambiguous character in combination with a certain font. However, it is possible these codes may pop up if you get your plain text from a source where the font was included.

    As for how to encode this into your strings: it doesn't really matter if the numerical code immediately following your private tag marker is a valid Unicode character or not. If you see one of your own tag markers, then the value immediately following is always your own private sequence number.

    As you see, there are lots of possibilities. I guess the most important criterium is whether you want to use other functions on these strings. If you create a string that is technically invalid Unicode (for instance, because it includes not-a-character values), some external functions may choose to fail to work on them, or silently remove the bad values. In such a case, you'd need to rigorously stick to a system in which you only use 'valid' code points.