Search code examples
pythonnumberscurrency-formatting

Parse currency into numbers in Python


I just learnt from Format numbers as currency in Python that the Python module babel provides babel.numbers.format_currency to format numbers as currency. For instance,

from babel.numbers import format_currency

s = format_currency(123456.789, 'USD', locale='en_US')  # u'$123,456.79'
s = format_currency(123456.789, 'EUR', locale='fr_FR')  # u'123\xa0456,79\xa0\u20ac'

How about the reverse, from currency to numbers, such as $123,456,789.00 --> 123456789? babel provides babel.numbers.parse_number to parse local numbers, but I didn't found something like parse_currency. So, what is the ideal way to parse local currency into numbers?


I went through Python: removing characters except digits from string.

# Way 1
import string
all=string.maketrans('','')
nodigs=all.translate(all, string.digits)

s = '$123,456.79'
n = s.translate(all, nodigs)    # 12345679, lost `.`

# Way 2
import re
n = re.sub("\D", "", s)         # 12345679

It doesn't take care the decimal separator ..


Remove all non-numeric characters, except for ., from a string (refer to here),

import re

# Way 1:
s = '$123,456.79'
n = re.sub("[^0-9|.]", "", s)   # 123456.79

# Way 2:
non_decimal = re.compile(r'[^\d.]+')
s = '$123,456.79'
n = non_decimal.sub('', s)      # 123456.79

It does process the decimal separator ..


But the above solutions don't work when coming to, for instance,

from babel.numbers import format_currency
s = format_currency(123456.789, 'EUR', locale='fr_FR')  # u'123\xa0456,79\xa0\u20ac'
new_s = s.encode('utf-8') # 123 456,79 €

As you can see, the format of currency varies. What is the ideal way to parse currency into numbers in a general way?


Solution

  • Using babel

    The babel documentation notes that the number parsing is not fully implemented yes but they have done a lot of work to get currency info into the library. You can use get_currency_name() and get_currency_symbol() to get currency details, and also all other get_... functions to get the normal number details (decimal point, minus sign, etc.).

    Using that information you can exclude from a currency string the currency details (name, sign) and groupings (e.g. , in the US). Then you change the decimal details into the ones used by the C locale (- for minus, and . for the decimal point).

    This results in this code (i added an object to keep some of the data, which may come handy in further processing):

    import re, os
    from babel import numbers as n
    from babel.core import default_locale
    
    class AmountInfo(object):
        def __init__(self, name, symbol, value):
            self.name = name
            self.symbol = symbol
            self.value = value
    
    def parse_currency(value, cur):
        decp = n.get_decimal_symbol()
        plus = n.get_plus_sign_symbol()
        minus = n.get_minus_sign_symbol()
        group = n.get_group_symbol()
        name = n.get_currency_name(cur)
        symbol = n.get_currency_symbol(cur)
        remove = [plus, name, symbol, group]
        for token in remove:
            # remove the pieces of information that shall be obvious
            value = re.sub(re.escape(token), '', value)
        # change the minus sign to a LOCALE=C minus
        value = re.sub(re.escape(minus), '-', value)
        # and change the decimal mark to a LOCALE=C decimal point
        value = re.sub(re.escape(decp), '.', value)
        # just in case remove extraneous spaces
        value = re.sub('\s+', '', value)
        return AmountInfo(name, symbol, value)
    
    #cur_loc = os.environ['LC_ALL']
    cur_loc = default_locale()
    print('locale:', cur_loc)
    test = [ (n.format_currency(123456.789, 'USD', locale=cur_loc), 'USD')
           , (n.format_currency(-123456.78, 'PLN', locale=cur_loc), 'PLN')
           , (n.format_currency(123456.789, 'PLN', locale=cur_loc), 'PLN')
           , (n.format_currency(123456.789, 'IDR', locale=cur_loc), 'IDR')
           , (n.format_currency(123456.789, 'JPY', locale=cur_loc), 'JPY')
           , (n.format_currency(-123456.78, 'JPY', locale=cur_loc), 'JPY')
           , (n.format_currency(123456.789, 'CNY', locale=cur_loc), 'CNY')
           , (n.format_currency(-123456.78, 'CNY', locale=cur_loc), 'CNY')
           ]
    
    for v,c in test:
        print('As currency :', c, ':', v.encode('utf-8'))
        info = parse_currency(v, c)
        print('As value    :', c, ':', info.value)
        print('Extra info  :', info.name.encode('utf-8')
                             , info.symbol.encode('utf-8'))
    

    The output looks promising (in US locale):

    $ export LC_ALL=en_US
    $ ./cur.py
    locale: en_US
    As currency : USD : b'$123,456.79'
    As value    : USD : 123456.79
    Extra info  : b'US Dollar' b'$'
    As currency : PLN : b'-z\xc5\x82123,456.78'
    As value    : PLN : -123456.78
    Extra info  : b'Polish Zloty' b'z\xc5\x82'
    As currency : PLN : b'z\xc5\x82123,456.79'
    As value    : PLN : 123456.79
    Extra info  : b'Polish Zloty' b'z\xc5\x82'
    As currency : IDR : b'Rp123,457'
    As value    : IDR : 123457
    Extra info  : b'Indonesian Rupiah' b'Rp'
    As currency : JPY : b'\xc2\xa5123,457'
    As value    : JPY : 123457
    Extra info  : b'Japanese Yen' b'\xc2\xa5'
    As currency : JPY : b'-\xc2\xa5123,457'
    As value    : JPY : -123457
    Extra info  : b'Japanese Yen' b'\xc2\xa5'
    As currency : CNY : b'CN\xc2\xa5123,456.79'
    As value    : CNY : 123456.79
    Extra info  : b'Chinese Yuan' b'CN\xc2\xa5'
    As currency : CNY : b'-CN\xc2\xa5123,456.78'
    As value    : CNY : -123456.78
    Extra info  : b'Chinese Yuan' b'CN\xc2\xa5'
    

    And it still works in different locales (Brazil is notable for using the comma as a decimal mark):

    $ export LC_ALL=pt_BR
    $ ./cur.py 
    locale: pt_BR
    As currency : USD : b'US$123.456,79'
    As value    : USD : 123456.79
    Extra info  : b'D\xc3\xb3lar americano' b'US$'
    As currency : PLN : b'-PLN123.456,78'
    As value    : PLN : -123456.78
    Extra info  : b'Zloti polon\xc3\xaas' b'PLN'
    As currency : PLN : b'PLN123.456,79'
    As value    : PLN : 123456.79
    Extra info  : b'Zloti polon\xc3\xaas' b'PLN'
    As currency : IDR : b'IDR123.457'
    As value    : IDR : 123457
    Extra info  : b'Rupia indon\xc3\xa9sia' b'IDR'
    As currency : JPY : b'JP\xc2\xa5123.457'
    As value    : JPY : 123457
    Extra info  : b'Iene japon\xc3\xaas' b'JP\xc2\xa5'
    As currency : JPY : b'-JP\xc2\xa5123.457'
    As value    : JPY : -123457
    Extra info  : b'Iene japon\xc3\xaas' b'JP\xc2\xa5'
    As currency : CNY : b'CN\xc2\xa5123.456,79'
    As value    : CNY : 123456.79
    Extra info  : b'Yuan chin\xc3\xaas' b'CN\xc2\xa5'
    As currency : CNY : b'-CN\xc2\xa5123.456,78'
    As value    : CNY : -123456.78
    Extra info  : b'Yuan chin\xc3\xaas' b'CN\xc2\xa5'
    

    It is worth to point out that babel has some encoding problems. That is because the locale files (in locale-data) do use different encoding themselves. If you're working with currencies you're familiar with that should not be a problem. But if you try unfamiliar currencies you might run into problems (i just learned that Poland uses iso-8859-2, not iso-8859-1).