Search code examples
javastringcharacter-encodinglocale

Why String.endsWith and String.startWith are not consistent?


I have the below test case and only the first assertion passes. Why?

@Test
public void test() {
    String i1 = "i";
    String i2 = "İ".toLowerCase();

    System.out.println((int)i1.charAt(0)); // 105
    System.out.println((int)i2.charAt(0)); // 105

    assertTrue(i2.startsWith(i1));

    assertTrue(i2.endsWith(i1));
    assertTrue(i1.endsWith(i2));
    assertTrue(i1.startsWith(i2));
}

Update after replies

What I am trying to is using startsWith and endsWith in a case insensitive way such that, below expression should return true.

"ALİ".toLowerCase().endsWith("i");

I guess it is different for C# and Java.


Solution

  • This happens because lowercase İ ("latin capital letter i with dot above") in English locales turn into the two characters: "latin small letter i" and "combining dot above".

    This explains why it starts with i, but doesnt end with i (it ends with a combining diacritic mark instead).

    In a Turkish locale, lowercase İ simply becomes "latin small letter i" in accordance with Turkish linguistics rules, and your code would therefore work.

    Here's a test program to help figure out what's going on:

    class Test {
      public static void main(String[] args) {
        char[] foo = args[0].toLowerCase().toCharArray();
        System.out.print("Lowercase " + args[0] + " has " + foo.length + " chars: ");
        for(int i=0; i<foo.length; i++) System.out.print("0x" + Integer.toString((int)foo[i], 16) + " ");
        System.out.println();
      }
    }
    

    Here's what we get when we run it on a system configured for English:

    $ LC_ALL=en_US.utf8 java Test "İ"
    Lowercase İ has 2 chars: 0x69 0x307
    

    Here's what we get when we run it on a system configured for Turkish:

    $ LC_ALL=tr_TR.utf8 java Test "İ"
    Lowercase İ has 1 chars: 0x69
    

    This is even the specific example used by the API docs for String.toLowerCase(Locale), which is the method you can use to get the lowercase version in a specific locale, rather than the system default locale.