I have a dataset that resemebles the following:
Capture MICR - Serial: Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: Split:
The problem that I am having is I can't figure out how I could properly write a capture for the "Capture MICR - Serial Field". This field could either be blank or contain an alphanumeric of varying length (I have the same problem with the other fields that could either be populated or blank.
I have tried some variations of the following, but am still coming up short.
pp.Literal("Capture MICR - Serial:") + pp.White(" ", min=1, max=0) + (pp.Word(pp.printables) ^ pp.White(" ", min=1, max=0))("crd_micr_serial") + pp.FollowedBy(pp.Literal("Pos44:"))
I think that part of the problem is that the Or
matches a parse for the longest match, which in this case could be a long whitespace character, with a single alphanumeric, but I would still want to capture the single value.
Thanks for everyone's help.
The simplest way to parse text like "A: valueA B: valueB C: valueC" is to use pyparsing's SkipTo class:
a_expr = "A:" + SkipTo("B:")
b_expr = "B:" + SkipTo("C:")
c_expr = "C:" + SkipTo(LineEnd())
line_parser = a_expr + b_expr + c_expr
I'd like to enhance this just a bit more:
add a parse action to strip off leading and trailing whitespace
add a results name to make it easy to get the results after the line has been parsed
Here is how that simple parser looks:
NL = LineEnd()
a_expr = "A:" + SkipTo("B:").addParseAction(lambda t: [t[0].strip()])('A')
b_expr = "B:" + SkipTo("C:").addParseAction(lambda t: [t[0].strip()])('B')
c_expr = "C:" + SkipTo(NL).addParseAction(lambda t: [t[0].strip()])('C')
line_parser = a_expr + b_expr + c_expr
line_parser.runTests("""
A: 100 B: Fred C:
A: B: a value with spaces C: 42
""")
Gives:
A: 100 B: Fred C:
['A:', '100', 'B:', 'Fred', 'C:', '']
- A: '100'
- B: 'Fred'
- C: ''
A: B: a value with spaces C: 42
['A:', '', 'B:', 'a value with spaces', 'C:', '42']
- A: ''
- B: 'a value with spaces'
- C: '42'
I try to avoid copy/paste code when I can, and would rather automate the "A is followed by B" and "C is followed by end-of-line" with a list describing the different prompt strings, and then walking that list to build each sub expression:
import pyparsing as pp
def make_prompt_expr(s):
'''Define the expression for prompts as 'ABC:' '''
return pp.Combine(pp.Literal(s) + ':')
def make_field_value_expr(next_expr):
'''Define the expression for the field value as SkipTo(what comes next)'''
return pp.SkipTo(next_expr).addParseAction(lambda t: [t[0].strip()])
def make_name(s):
'''Convert prompt string to identifier form for results names'''
return ''.join(s.split()).replace('-','_')
# use split to easily define list of prompts in order - makes it easy to update later if new prompts are added
prompts = "Capture MICR - Serial/Pos44/Trrt/Acct/Tc/Opt4/Split".split('/')
# keep a list of all the prompt-value expressions
exprs = []
# get a list of this-prompt, next-prompt pairs
for this_, next_ in zip(prompts, prompts[1:] + [None]):
field_name = make_name(this_)
if next_ is not None:
next_expr = make_prompt_expr(next_)
else:
next_expr = pp.LineEnd()
# define the prompt-value expression for the current prompt string and add to exprs
this_expr = make_prompt_expr(this_) + make_field_value_expr(next_expr)(field_name)
exprs.append(this_expr)
# define a line parser as the And of all of the generated exprs
line_parser = pp.And(exprs)
line_parser.runTests("""\
Capture MICR - Serial: Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: Split:
Capture MICR - Serial: 1729XYZ Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: XXL Split: 50
""")
Gives:
Capture MICR - Serial: Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: Split:
['Capture MICR - Serial:', '', 'Pos44:', '', 'Trrt:', '32904', 'Acct:', '', 'Tc:', '2064', 'Opt4:', '', 'Split:', '']
- Acct: ''
- CaptureMICR_Serial: ''
- Opt4: ''
- Pos44: ''
- Split: ''
- Tc: '2064'
- Trrt: '32904'
Capture MICR - Serial: 1729XYZ Pos44: Trrt: 32904 Acct: Tc: 2064 Opt4: XXL Split: 50
['Capture MICR - Serial:', '1729XYZ', 'Pos44:', '', 'Trrt:', '32904', 'Acct:', '', 'Tc:', '2064', 'Opt4:', 'XXL', 'Split:', '50']
- Acct: ''
- CaptureMICR_Serial: '1729XYZ'
- Opt4: 'XXL'
- Pos44: ''
- Split: '50'
- Tc: '2064'
- Trrt: '32904'