Search code examples
pythonpython-3.xunicodecpythonpython-module-unicodedata

What is the difference between unicodedata.digit and unicodedata.numeric?


From unicodedata doc:

unicodedata.digit(chr[, default]) Returns the digit value assigned to the character chr as integer. If no such value is defined, default is returned, or, if not given, ValueError is raised.

unicodedata.numeric(chr[, default]) Returns the numeric value assigned to the character chr as float. If no such value is defined, default is returned, or, if not given, ValueError is raised.

Can anybody explain me the difference between those two functions?

Here ones can read the implementation of both functions but is not evident for me what is the difference from a quick look because I'm not familiar with CPython implementation.

EDIT 1:

Would be nice an example that shows the difference.

EDIT 2:

Examples useful to complement the comments and the spectacular answer from @user2357112:

print(unicodedata.digit('1')) # Decimal digit one.
print(unicodedata.digit('١')) # ARABIC-INDIC digit one
print(unicodedata.digit('¼')) # Not a digit, so "ValueError: not a digit" will be generated.

print(unicodedata.numeric('Ⅱ')) # Roman number two.
print(unicodedata.numeric('¼')) # Fraction to represent one quarter.

Solution

  • Short answer:

    If a character represents a decimal digit, so things like 1, ¹ (SUPERSCRIPT ONE), (CIRCLED DIGIT ONE), ١ (ARABIC-INDIC DIGIT ONE), unicodedata.digit will return the digit that character represents as an int (so 1 for all of these examples).

    If the character represents any numeric value, so things like (VULGAR FRACTION ONE SEVENTH) and all the decimal digit examples, unicodedata.numeric will give that character's numeric value as a float.

    For technical reasons, more recent digit characters like 🄌 (DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ZERO) may raise a ValueError from unicodedata.digit.


    Long answer:

    Unicode characters all have a Numeric_Type property. This property can have 4 possible values: Numeric_Type=Decimal, Numeric_Type=Digit, Numeric_Type=Numeric, or Numeric_Type=None.

    Quoting the Unicode standard, version 10.0.0, section 4.6,

    The Numeric_Type=Decimal property value (which is correlated with the General_Category=Nd property value) is limited to those numeric characters that are used in decimal-radix numbers and for which a full set of digits has been encoded in a contiguous range, with ascending order of Numeric_Value, and with the digit zero as the first code point in the range.

    Numeric_Type=Decimal characters are thus decimal digits fitting a few other specific technical requirements.

    Decimal digits, as defined in the Unicode Standard by these property assignments, exclude some characters, such as the CJK ideographic digits (see the first ten entries in Table 4-5), which are not encoded in a contiguous sequence. Decimal digits also exclude the compatibility subscript and superscript digits, to prevent simplistic parsers from misinterpreting their values in context. (For more information on superscript and subscripts, see Section 22.4, Superscript and Subscript Symbols.) Traditionally, the Unicode Character Database has given these sets of noncontiguous or compatibility digits the value Numeric_Type=Digit, to recognize the fact that they consist of digit values but do not necessarily meet all the criteria for Numeric_Type=Decimal. However, the distinction between Numeric_Type=Digit and the more generic Numeric_Type=Numeric has proven not to be useful in implementations. As a result, future sets of digits which may be added to the standard and which do not meet the criteria for Numeric_Type=Decimal will simply be assigned the value Numeric_Type=Numeric.

    So Numeric_Type=Digit was historically used for other digits not fitting the technical requirements of Numeric_Type=Decimal, but they decided that wasn't useful, and digit characters not meeting the Numeric_Type=Decimal requirements have just been assigned Numeric_Type=Numeric since Unicode 6.3.0. For example, 🄌 (DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ZERO) introduced in Unicode 7.0 has Numeric_Type=Numeric.

    Numeric_Type=Numeric is for all characters that represent numbers and don't fit in the other categories, and Numeric_Type=None is for characters that don't represent numbers (or at least, don't under normal usage).

    All characters with a non-None Numeric_Type property have a Numeric_Value property representing their numeric value. unicodedata.digit will return that value as an int for characters with Numeric_Type=Decimal or Numeric_Type=Digit, and unicodedata.numeric will return that value as a float for characters with any non-None Numeric_Type.