Search code examples
pythonpython-3.xpyparsing

pyparsing range parsing over integer byte representation


I am wondering whether pyparsing can parse and detect (in an easy way) a range of integers represented over several bytes. Here is a piece of code I can use to parse the integer part and then do something with it (just printing it with whatever is after):

from pyparsing import *
import struct
import re

min = 0x06A1D58C  # 111_269_260
max = 0x14B4CB1C  # 347_392_796

line_a = b'3F\x09\x21\xe4\xc0KBHDVC'

ParserElement.setDefaultWhitespaceChars("")
expr = Suppress('3F') + Regex(re.compile(r'.{4}', re.DOTALL)).setResultsName('id') + Word(
    srange('[A-Z]')).setResultsName('code')
expr.parseWithTabs()
try:
    result = expr.parseString(line_a.decode('latin-1'), parseAll=False)
    print(result.get('id').encode('latin-1'))
    id= struct.unpack('!I', result.get('id').encode('latin-1'))[0]
    code = result.get('code')
    if min <= id <= max:
        print(id, code)
except ParseException as e:
    print(e.explain(e))

output:

b'\t!\xe4\xc0'
153216192 KBHDVC

Now what I would like is to be able have an expression that is going to specify the range in the integer form along with what is after. This way one could specify several syntax depending on this integer.

Is this possible? Or do I have to keep it outside the parsing as post processing?


Solution

  • If you want this conversion and validation to happen as part of your expression definition, you can add a parse-time callback, or parse action:

    binary_bytes = Regex(re.compile(r'.{4}', re.DOTALL))
    def unpack(tokens):
        return struct.unpack('!I', tokens[0].encode('latin-1'))[0]
    binary_bytes.addParseAction(unpack)
    

    Parse actions can take the parsed tokens and return a converted or augmented value.

    You can also implement a filter like your range check using a parse action like this:

    def in_range(tokens):
        if not (min <= tokens[0] <= max):
            raise ParseException()
    

    This kind of filter or validator is common enough that you can define it using addCondition:

    binary_bytes.addCondition(lambda tokens: min <= tokens[0] <= max)
    

    I reformatted and repackaged your example as follows:

    def make_range_condition(minval, maxval):
        in_range = lambda x, minval=minval, maxval=maxval: minval <= x <= maxval
        return lambda t: in_range(t[0])
    
    binary_bytes = Regex(re.compile(r'.{4}', re.DOTALL))
    binary_bytes.addParseAction(lambda tokens: struct.unpack('!I', tokens[0].encode('latin-1'))[0])
    binary_bytes.addCondition(make_range_condition(min, max))
    
    ParserElement.setDefaultWhitespaceChars("")
    expr = (Suppress('3F')
            + binary_bytes('id')
            + Word(srange('[A-Z]'))('code')
            )
    expr.parseWithTabs()
    
    try:
        result = expr.parseString(line_a.decode('latin-1'), parseAll=False)
        print(result.dump())
    except ParseException as e:
        print(e.explain(e))
    

    dump() gives this output:

    [153216192, 'KBHDVC']
    - code: 'KBHDVC'
    - id: 153216192