Search code examples
pythonpyparsing

pyparsing a field that may or may not contain values


I have a dataset that resemebles the following:

Capture MICR - Serial: Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: Split:

The problem that I am having is I can't figure out how I could properly write a capture for the "Capture MICR - Serial Field". This field could either be blank or contain an alphanumeric of varying length (I have the same problem with the other fields that could either be populated or blank.

I have tried some variations of the following, but am still coming up short.

pp.Literal("Capture MICR - Serial:") + pp.White(" ", min=1, max=0) + (pp.Word(pp.printables) ^ pp.White(" ", min=1, max=0))("crd_micr_serial") + pp.FollowedBy(pp.Literal("Pos44:"))

I think that part of the problem is that the Or matches a parse for the longest match, which in this case could be a long whitespace character, with a single alphanumeric, but I would still want to capture the single value.

Thanks for everyone's help.


Solution

  • The simplest way to parse text like "A: valueA B: valueB C: valueC" is to use pyparsing's SkipTo class:

    a_expr = "A:" + SkipTo("B:")
    b_expr = "B:" + SkipTo("C:")
    c_expr = "C:" + SkipTo(LineEnd())
    line_parser = a_expr + b_expr + c_expr
    

    I'd like to enhance this just a bit more:

    • add a parse action to strip off leading and trailing whitespace

    • add a results name to make it easy to get the results after the line has been parsed

    Here is how that simple parser looks:

    NL = LineEnd()
    a_expr = "A:" + SkipTo("B:").addParseAction(lambda t: [t[0].strip()])('A')
    b_expr = "B:" + SkipTo("C:").addParseAction(lambda t: [t[0].strip()])('B')
    c_expr = "C:" + SkipTo(NL).addParseAction(lambda t: [t[0].strip()])('C')
    line_parser = a_expr + b_expr + c_expr
    
    line_parser.runTests("""
        A: 100 B: Fred C:
        A:  B: a value with spaces C: 42
    """)
    

    Gives:

     A: 100 B: Fred C:
    ['A:', '100', 'B:', 'Fred', 'C:', '']
    - A: '100'
    - B: 'Fred'
    - C: ''
    
    
    A:  B: a value with spaces C: 42
    ['A:', '', 'B:', 'a value with spaces', 'C:', '42']
    - A: ''
    - B: 'a value with spaces'
    - C: '42'
    

    I try to avoid copy/paste code when I can, and would rather automate the "A is followed by B" and "C is followed by end-of-line" with a list describing the different prompt strings, and then walking that list to build each sub expression:

    import pyparsing as pp
    
    def make_prompt_expr(s):
        '''Define the expression for prompts as 'ABC:' '''
        return pp.Combine(pp.Literal(s) + ':')
    
    def make_field_value_expr(next_expr):
        '''Define the expression for the field value as SkipTo(what comes next)'''
        return pp.SkipTo(next_expr).addParseAction(lambda t: [t[0].strip()])
    
    def make_name(s):
        '''Convert prompt string to identifier form for results names'''
        return ''.join(s.split()).replace('-','_')
    
    # use split to easily define list of prompts in order - makes it easy to update later if new prompts are added
    prompts = "Capture MICR - Serial/Pos44/Trrt/Acct/Tc/Opt4/Split".split('/')
    
    # keep a list of all the prompt-value expressions
    exprs = []
    
    # get a list of this-prompt, next-prompt pairs
    for this_, next_ in zip(prompts, prompts[1:]  + [None]):
        field_name = make_name(this_)
        if next_ is not None:
            next_expr = make_prompt_expr(next_)
        else:
            next_expr = pp.LineEnd()
    
        # define the prompt-value expression for the current prompt string and add to exprs
        this_expr = make_prompt_expr(this_) + make_field_value_expr(next_expr)(field_name)
        exprs.append(this_expr)
    
    # define a line parser as the And of all of the generated exprs
    line_parser = pp.And(exprs)
    
    line_parser.runTests("""\
    Capture MICR - Serial:                  Pos44:  Trrt: 32904  Acct:        Tc:   2064        Opt4:          Split:
    Capture MICR - Serial:  1729XYZ                Pos44:  Trrt: 32904  Acct:        Tc:   2064        Opt4: XXL         Split: 50
    """)
    

    Gives:

    Capture MICR - Serial:                  Pos44:  Trrt: 32904  Acct:        Tc:   2064        Opt4:          Split:
    ['Capture MICR - Serial:', '', 'Pos44:', '', 'Trrt:', '32904', 'Acct:', '', 'Tc:', '2064', 'Opt4:', '', 'Split:', '']
    - Acct: ''
    - CaptureMICR_Serial: ''
    - Opt4: ''
    - Pos44: ''
    - Split: ''
    - Tc: '2064'
    - Trrt: '32904'
    
    
    Capture MICR - Serial:  1729XYZ                Pos44:  Trrt: 32904  Acct:        Tc:   2064        Opt4: XXL         Split: 50
    ['Capture MICR - Serial:', '1729XYZ', 'Pos44:', '', 'Trrt:', '32904', 'Acct:', '', 'Tc:', '2064', 'Opt4:', 'XXL', 'Split:', '50']
    - Acct: ''
    - CaptureMICR_Serial: '1729XYZ'
    - Opt4: 'XXL'
    - Pos44: ''
    - Split: '50'
    - Tc: '2064'
    - Trrt: '32904'