Search code examples
pythonpyparsing

PyParsing ignores newline?


I want to parse a git log file that looks like this:

d2436fa AuthorName 2015-05-15 Commit Message
4    3    README.md

The output I'm expecting looks like this:

[ ['d2436fa', 'AuthorName', '2015-05-15', 'Commit Message'],
[4, 3, 'README.md'] ]

My grammar to parse this is:

hsh = Word(alphanums, exact=7)
author = OneOrMore(Word(alphas + alphas8bit + '.'))
date = Regex('\d{4}-\d{2}-\d{2}')
message = OneOrMore(Word(printables + alphas8bit))
count = Word(nums)
file = Word(printables)
blankline = LineStart() + LineEnd()

commit = hsh + Combine(author, joinString=' ', adjacent=False) + \
         date + Combine(message, joinString=' ', adjacent=False) + LineEnd()
changes = count + count + file + LineEnd()
check = commit ^ changes ^ blankline

The output I actually get is:

['d2436fa', 'AuthorName', '2015-05-15', 'Commit Message 4 3 README.md']

Why is the newline ignored? I thought that is what LineEnd() is for? When I split over '\n' everything works fine :/


Solution

  • pyparsing has a (controversial?) rule about whitespace in grammars:

    During the matching process, whitespace between tokens is skipped by default (although this can be changed)

    And, as it says, it can be changed. You can set what is considered a whitespace by pp by doing something like the following:

    i_consider_whitespaces_to_be_only = ' '
    ParserElement.setDefaultWhitespaceChars(i_consider_whitespaces_to_be_only)
    

    (this will tell it to use only spaces, not newlines; of course, you could also add other stuff, e.g., tabs.)