How to parse string datetime & timezone with Arabic-Hindu digits in Java 8?

I wanted to parse string datetime & timezone with Arabic-Hindu digits, so I wrote a code like this:

    String dateTime = "٢٠٢١-١١-٠٨T٠٢:٢١:٠٨+٠٢:٠٠";
    char zeroDigit = '٠';
    Locale locale = Locale.forLanguageTag("ar");
    DateTimeFormatter pattern = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ssXXX")
            .withLocale(locale)
            .withDecimalStyle(DecimalStyle.of(locale).withZeroDigit(zeroDigit));
    ZonedDateTime parsedDateTime = ZonedDateTime.parse(dateTime, pattern);
    assert parsedDateTime != null;

But I received the exception:

java.time.format.DateTimeParseException: Text '٢٠٢١-١١-٠٨T٠٢:٢١:٠٨+٠٢:٠٠' could not be parsed at index 19

I checked a lot of questions on Stackoverflow, but I still don't understand what I did wrong.

It works fine with dateTime = "٢٠٢١-١١-٠٨T٠٢:٢١:٠٨+02:00" when the timezone doesn't use Arabic-Hindu digits.

Solution

Your dateTime string is wrong, misunderstood. It obviously tries to conform to the ISO 8601 format and fails. Because the ISO 8601 format uses US-ASCII digits.

The classes of java.time (Instant, OffsetDateTime and ZonedDateTime) would parse your string without any formatter if only the digits were correct for ISO 8601. In the vast majority of cases I would take your avenue: try to parse the string as it is. Not in this case. To me it makes more sense to correct the string before parsing.

    String dateTime = "٢٠٢١-١١-٠٨T٠٢:٢١:٠٨+٠٢:٠٠";
    char[] dateTimeChars = dateTime.toCharArray();
    for (int index = 0; index < dateTimeChars.length; index++) {
        if (Character.isDigit(dateTimeChars[index])) {
            int digitValue = Character.getNumericValue(dateTimeChars[index]);
            dateTimeChars[index] = Character.forDigit(digitValue, 10);
        }
    }
    
    OffsetDateTime odt = OffsetDateTime.parse(CharBuffer.wrap(dateTimeChars));
    
    System.out.println(odt);

Output:

2021-11-08T02:21:08+02:00

Edit: It will be even better, of course, if you can educate the publisher of the string to use US-ASCII digits.

Edit: I know the Wikipedia article I link to below says:

Representations must be written in a combination of Arabic numerals and the specific computer characters (such as "-", ":", "T", "W", "Z") that are assigned specific meanings within the standard; …

This is one thinkable cause of the confusion. The article Arabic numerals linked to says:

Arabic numerals are the ten digits: 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9.

Edit: How I convert each digit: Character.getNumericValue() converts from a char representing a digit to an int equal to the number that the digit represents, so '٠' to 0, '٢' to 2, etc. It works for all characters that are digits (not only Arabic and ASCII ones). Character.forDigit() performs sort of the opposite conversion, only always to US ASCII, so 0 to '0', 2 to '2', etc.

Edit: Thanks to @Holger for drawing my attention to CharBuffer in this context. A CharBuffer implements CharSequence, the type that the parse methods of java.time require, so saves us from converting the char array back to a String.

How to parse string datetime & timezone with Arabic-Hindu digits in Java 8?

Links