Search code examples
pythonparsingpyparsing

Why is this pyparsing grammar not respecting line endings?


I'm writing a parser for a file format, and have an example I've reduced to the following:

import pyparsing as pp

element = pp.OneOrMore(pp.Word(pp.alphas)) | pp.Literal("|")
line = pp.Group(pp.OneOrMore(element)) + pp.White("\n")
top_level = pp.OneOrMore(line)

f = """
sdf dfg sdfgsdfsd | dsfgsdfsd sd sddffds safd | dfgdfg sadf | 
dsfg gdfg asdsad | gdfgdf dfgdfgdf sdf | dfgdfgdf |
"""

parse_result = top_level.parseString(f)
print(parse_result.dump())

This gives:

[['sdf', 'dfg', 'sdfgsdfsd', '|', 'dsfgsdfsd', 'sd', 'sddffds', 'safd', '|', 'dfgdfg', 'sadf', '|', 'dsfg', 'gdfg', 'asdsad', '|', 'gdfgdf', 'dfgdfgdf', 'sdf', '|', 'dfgdfgdf', '|'], '\n']
[0]:
  ['sdf', 'dfg', 'sdfgsdfsd', '|', 'dsfgsdfsd', 'sd', 'sddffds', 'safd', '|', 'dfgdfg', 'sadf', '|', 'dsfg', 'gdfg', 'asdsad', '|', 'gdfgdf', 'dfgdfgdf', 'sdf', '|', 'dfgdfgdf', '|']
[1]:

What I want is for each line of text to appear as a separate Group(), and it's not clear to me why the pp.White("\n") statement isn't matching the first one (I have also tried LineEnd(), with the same result).


Solution

  • You really need just one more line, involving ParserElement.setDefaultWhitespaceChars to remove newlines as one of the default white space characters. I also 'swallow' the newlines with a suppress, like this.

    >>> import pyparsing as pp
    >>> pp.ParserElement.setDefaultWhitespaceChars(' \t')
    >>> element = pp.OneOrMore(pp.Word(pp.alphas)) | pp.Literal("|")
    >>> line = pp.Group(pp.OneOrMore(element)) + pp.White("\n").suppress()
    >>> top_level = pp.OneOrMore(line)
    >>> f = '''\
    ... sdf dfg sdfgsdfsd | dsfgsdfsd sd sddffds safd | dfgdfg sadf | 
    ... dsfg gdfg asdsad | gdfgdf dfgdfgdf sdf | dfgdfgdf |
    ... '''
    
    >>> r = top_level.parseString(f)
    >>> for item in r.asList():
    ...     item
    ... 
    ['sdf', 'dfg', 'sdfgsdfsd', '|', 'dsfgsdfsd', 'sd', 'sddffds', 'safd', '|', 'dfgdfg', 'sadf', '|']
    ['dsfg', 'gdfg', 'asdsad', '|', 'gdfgdf', 'dfgdfgdf', 'sdf', '|', 'dfgdfgdf', '|']