Search code examples
pythonpython-3.xpyparsing

pyparsing - Parse numbers with thousand separators


So I am making a parser, and I noticed a problem. Indeed, to parse numbers, I have:

from pyparsing import Word, nums
n = Word(nums)

This works well with numbers without thousands separators. For example, n.parseString("1000", parseAll=True) returns (['1000'], {}) and therefore works.

However, it doesn't work when I add the thousand separator. Indeed, n.parseString("1,000", parseAll=True) raises pyparsing.ParseException: Expected end of text, found ',' (at char 1), (line:1, col:2).

How can I parse numbers with thousand separators? I don't just want to ignore commas (for example, n.parseString("1,00", parseAll=True) should return an error as it is not a number).


Solution

  • A pure pyparsing approach would use Combine to wrap a series of pyparsing expressions representing the different fields that you are seeing in the regex:

    import pyparsing as pp
    
    int_with_thousands_separators = pp.Combine(pp.Optional("-") 
                                               + pp.Word(pp.nums, max=3)
                                               + ("," + pp.Word(pp.nums, exact=3))[...])
    

    I've found that building up numeric expressions like this results in much slower parse time, because all those separate parts are parsed independently, with multiple internal function and method calls (which are real performance killers in Python). So you can replace this with an expression using Regex:

    # more efficient parsing with a Regex
    int_with_thousands_separators = pp.Regex(r"-?\d{1,3}(,\d{3})*")
    

    You could also use the code as posted by Jan, and pass that compiled regex to the Regex constructor.

    To do parse-time conversion to int, add a parse action that strips out the commas.

    # add parse action to convert to int, after stripping ','s
    int_with_thousands_separators.addParseAction(
        lambda t: int(t[0].replace(",", "")))
    

    I like using runTests to check out little expressions like this - it's easy to write a series of test strings, and the output shows either the parsed result or an annotated input string with the parse failure location. ("1,00" is included as an intentional error to demonstrate error output by runTests.)

    int_with_thousands_separators.runTests("""\
        1
        # invalid value
        1,00
        1,000
        -3,000,100
        """)
    

    If you want to parse real numbers, add pieces to represent the trailing decimal point and following digits.

    real_with_thousands_separators = pp.Combine(pp.Optional("-") 
                                               + pp.Word(pp.nums, max=3)
                                               + ("," + pp.Word(pp.nums, exact=3))[...]
                                               + "." + pp.Word(pp.nums))
    
    # more efficient parsing with a Regex
    real_with_thousands_separators = pp.Regex(r"-?\d{1,3}(,\d{3})*\.\d+")
    
    # add parse action to convert to float, after stripping ','s
    real_with_thousands_separators.addParseAction(
        lambda t: float(t[0].replace(",", "")))
    
    real_with_thousands_separators.runTests("""\
        # invalid values
        1
        1,00
        1,000
        -3,000,100
        1.
    
        # valid values
        1.732
        -273.15
        """)