Search code examples
javaunicodetext-formatting

How to pad Strings with Unicode characters in Java


I add right padding to a String to output it in a table format.

for (String[] tuple : testData) {
  System.out.format("%-32s -> %s\n", tuple[0], tuple[1]);
}

The result looks like this (random test data):

znZfmOEQ0Gb68taaNU6HY21lvo       -> Xq2aGqLedQnTSXg6wmBNDVb
frKweMCH8Kvgyk0J                 -> lHJ5r7YDV0jTL
NxtHP                            -> odvPJklwIzZZ
NX2scXjl5dxWmer                  -> wPDlKCKllVKk
x2HKsSHCqDQ                      -> RMuWLZ2vaP9sOF0yHmjVysJ
b0hryXKd6b80xAI                  -> 05MHjvTOxlxq1bvQ8RGe

This approach does not work when there are multi-byte unicode characters:

0OZot🇨🇳ivbyG🧷hZM1FI👡wNhn6r6cC -> OKDxDV1o2NMqXH3VvE7q3uONwEcY5V
fBHRCjU4K8OCdzACmQZSn6WO         -> gvGBtUO5a4gPMKj9BKqBHFKx1iO7
cDUh🇲🇺b0cXkLWkS                -> SZX
WtP9t                            -> Q0wWOeY3W66mM5rcQQYKpG
va4d🍷u8SS                       -> KI
a71?⚖TZ💣🧜‍♀🕓ws5J              -> b8A

As you can see, the alignment is off.

My idea was to calculate the difference between the length of the String and the number of bytes used and use that to offset the padding, something like this:

int correction = tuple[0].getBytes().length - tuple[0].length();

And then instead of padding to 32 chars, I would pad to 32 + correction. However, this didn't work either.

Here is my test code (using emoji-java but the behaviour should be reproducable with any unicode characters):

import java.util.Collection;
import org.apache.commons.lang3.RandomStringUtils;
import com.vdurmont.emoji.Emoji;
import com.vdurmont.emoji.EmojiManager;

public class Test {

  public static void main(String[] args) {
    // create random test data
    String[][] testData = new String[15][2];
    for (String[] tuple : testData) {
      tuple[0] = RandomStringUtils.randomAlphanumeric(2, 32);
      tuple[1] = RandomStringUtils.randomAlphanumeric(2, 32);
    }

    // add some emojis
    Collection<Emoji> all = EmojiManager.getAll();
    for (String[] tuple : testData) {
      for (int i = 1; i < tuple[0].length(); i++) {
        if (Math.random() > 0.90) {
          Emoji emoji = all.stream().skip((int) (all.size() * Math.random())).findFirst().get();
          tuple[0] = tuple[0].substring(0, i - 1) + emoji.getUnicode() + tuple[0].substring(i + 1);
        }
      }
    }

    // output
    for (String[] tuple : testData) {
      System.out.format("%-32s -> %s\n", tuple[0], tuple[1]);
    }
  }
}

Solution

  • There are actually a few issues here, other than that some fonts display the flag wider than the other characters. I assume that you want to count the Chinese flag as a single character (as it is drawn as a single element on the screen).

    The String class reports an incorrect length

    The String class works with chars, which are 16-bit integers of Unicode code points. The problem is that not all code points fit in 16 bits, only code points from the Basic Multilingual Plane (BMP) fit in those chars. String's length() method returns the number of chars, not the number of code points.

    Now String's codePointCount method may help in this case: it counts the number of code points in the given index range. So providing string.length() as second argument to the method returns the total count of code points.

    Combining characters

    However, there's another problem. The 🇨🇳 Chinese flag, for example, consists of two Unicode code points: the Regional Indicator Symbol Letters C (🇨, U+1F1E8) and N (🇳, U+1F1F3). Those two code points are combined into a flag of China. This is a problem you are not going to solve with the codePointCount method.

    The Regional Indicator Symbol Letters seem to be a special occasion. Two of those characters can be combined into a national flag. I am not aware of a standard way to achieve what you want. You may have to take that manually into account.

    I've written a small program to get the length of a string.

    static int length(String str) {
        String a = "\uD83C\uDDE6";
        String z = "\uD83C\uDDFF";
    
        Pattern p = Pattern.compile("[" + a + "-" + z + "]{2}");
        Matcher m = p.matcher(str);
        int count = 0;
        while (m.find()) {
            count++;
        }
        return str.codePointCount(0, str.length()) - count;
    }