Search code examples
iosarraysswiftright-to-leftleft-to-right

Weird behaviour on joining mixed right-to-left and left-to-right language strings


Input:

tempTextArray: ▿ 3 elements

- 0 : "זה מבחן"
- 1 : "7 x 5 abc"
- 2 : "other text"

When doing a simple tempText = tempTextArray.joined(" | ") the result is not placing all elements where I'd expect them... result:

Printing description of tempText:
"זזה מבחן | 7 x 5 abc | other text"

its my first time combining right-to-left with left-to-right texts, has anyone dealt with a similar situation before?

My app is receiving translations from backend, so I don't know what elements are translated to (in this case) Hebrew, and which I will receive in my default language (English)


Solution

  • This is caused by the Unicode BIDI (Bidirectional Text) algorithm. First, I'll explain how to fix it, since it's fairly straightforward, then I'll explain what's happening in case you want more information.

    You need to add LTR (Left-To-Right Mark) characters at each place you want to reset the text direction to be LTR. In your case that's at the start of the string and at the start of each | block:

    let ltr = "\u{200e}"
    let tempText = ltr + tempTextArray.joined(separator: "\(ltr) | ")
    // => ‎זה מבחן‎ | 7 x 5 abc‎ | other text
    

    If you're going to do work with Hebrew, you absolutely want to read Cal Henderson's fantastic explanation of the algorithm: Understanding Bidirectional (BIDI) Text in Unicode.

    Now to explain what's happening. You're printing a single string whose first character is the ז in "זה מבחן," and last character is the final t in "text." It is not three strings separated by |, it's just one long string. When you display that string, and the BIDI algorithm has to decide where all the characters go.

    The first character (ז) is a RTL character, so it decides that this is a RTL string that has some LTR text embedded. That's the opposite of what you want. You want this to be a LTR string with some RTL text embedded. So you need to start with a LTR character such as Left-To-Right Mark.

    The BIDI algorithm's job is to tell the system in which direction the next character should go. Each of the characters in זה are RTL, so that's easy, keep going left. But what about the space between זה and מבחן? Space is neutral in direction, and the last character was RTL, so the space goes to the left. But then we come to the space between מבחן and |. Space is neutral and | is neutral, so the BIDI algorithm would put the space and | to the left again. You want the space and | to be LTR, so you need to add another LTR character there.

    7 is also neutral, but x is LATIN SMALL LETTER X which is LTR (not MULTIPLICATION X which is neutral).

    The final result is that the BIDI algorithm decides this is a RTL string that begins 7 | זה מבחן and then is followed (to the left) by an embedded LTR string x 5 abc | other text. (In other words, this is a Hebrew string that happens to have some English in it, not an English string that happens to have some Hebrew.)

    I expect what's actually displayed in your question above isn't what you're seeing (because of how BIDI algorithms get applied on Stack Overflow). I expect it actually looks like this:

    Embedded LTR string in a RTL string

    And if you read this right to left, it should make more sense now what's happening.