Search code examples
pyparsing

How to get rid of trailing whitespaces


I'm attempting to finalize the ISC-style (Bind9/DHCP) configuration parser in pyparsing here (after searching the GitHub, Google, et. al. for so long).

ISC-style configuration file has the following quirky text attributes:

  • All C/C++/Bash comment styles
  • include file support
  • semicolon terminates before keywords
  • semicolon may or may not be directly next to token pattern
  • multi-line support (semicolon may be several lines later)

The closest coding style of ISC-style config syntax (also in pyparsing) which is NGINX, which I saw over there on GitHub. But that would mean ditching the auto-Whitespace handling of pyparsing, for I would like to keep that, if possible.

My already made PyParsing grammar syntax tree is now on shaky ground when I started performing input-fuzz unit testing.

[['server', 'example.com']]
[['server', 'example.com ']]
[['server', 'example.com      ']]
[['server', 'example.com']]
[['server', 'example.com ']]
[['server', 'example.com     ']]
[['server', 'example.com    ']]
[['server', 'example.com      ']]
[['server', 'example.com                     ']]
['options', ['server', 'example.com     '], ['server2', 'example2.net   ']]

I have the snippet of grammar code:

lbrack = Literal("{").suppress()
rbrack = Literal("}").suppress()
period = Literal(".")
semicolon = Literal(";").suppress()

domain_name = Word(srange("[0-9A-Za-z]"), min=1, max=63)
domain_name.setName("domain")
fqdn = originalTextFor(domain_name - \
                       originalTextFor(period - \
                                       domain_name) * (0, 16) - \
                       Optional(period))
fqdn.setName("fully-qualified domain name")
orig_fqdn = originalTextFor(fqdn).setName('FQDN')
options_server = Group(Keyword("server") - fqdn - semicolon)
options_server2 = Group(Keyword("server2") - fqdn - semicolon)
options_group = Optional(options_server) & \
                      Optional(options_server2) \

I'm still not able to get rid of the trailing whitespaces.

Tried the following to no avail:

iwsp = Optional(Word("[ \t]")).suppress() # Ignore WhiteSPace
options_server = Group(Keyword("server") - fqdn - iwsp - semicolon)

What am I doing wrong?

A complete working Python snippet enclosed below:

#!/usr/bin/env python3

from pyparsing import Literal, Word, srange, \
    originalTextFor, Optional, ParseException, \
    OneOrMore, Keyword, ZeroOrMore, \
    ParseSyntaxException, Group

lbrack = Literal("{").suppress()
rbrack = Literal("}").suppress()
period = Literal(".")
semicolon = Literal(";").suppress()

domain_name = Word(srange("[0-9A-Za-z]"), min=1, max=63)
domain_name.setName("domain")
fqdn = originalTextFor(domain_name - \
                       originalTextFor(period - \
                                       domain_name) * (0, 16) - \
                       Optional(period))
fqdn.setName("fully-qualified domain name")
orig_fqdn = originalTextFor(fqdn).setName('FQDN')
options_server = Group(Keyword("server") - fqdn - semicolon)
options_server2 = Group(Keyword("server2") - fqdn - semicolon)
options_group = Optional(options_server) & \
                      Optional(options_server2) \
                      # | had a bunch of other options commented out
options_clause = Keyword("options") - \
                     lbrack - \
                     options_group - \
                     rbrack - \
                     semicolon
statement = options_clause # | had a bunch of other clauses commented out
isc_style_syntax = statement


def parse_me(parse_element, test_data):

    greeting = parse_element.parseString(test_data, parseAll=True)
    greeting.pprint(indent=4)


if __name__ == '__main__':
    parse_me(options_server, "server example.com;")
    parse_me(options_server, "server example.com ;")
    parse_me(options_server, "server example.com\t;")
    parse_me(options_server, "server\texample.com;")
    parse_me(options_server, "server\texample.com ;")
    parse_me(options_server, "server\texample.com\t;")
    parse_me(options_server, "server     example.com    ;")
    parse_me(options_server, "server\t \texample.com \t ;")
    parse_me(options_server, "server\t\t\texample.com\t\t\t;")
    parse_me(statement, "options { server\t \texample.com \t;\n server2\t\t\t\t\t\t\t\t\t\t\t\t example2.net\t;\n}\n ;") 

Solution

  • The issue is:

    fqdn = originalTextFor(domain_name - \
                       originalTextFor(period - \
                                       domain_name) * (0, 16) - \
                       Optional(period))
    

    Since there is repetition and the trailing Optional bit, it seems that originalTextFor keeps reading and pulling in characters until it actually fails on the repetition. However, if you change this to:

    fqdn = Combine(domain_name - \
                       originalTextFor(period + \
                                       domain_name) * (0, 16) - \
                       Optional(period))
    

    Then your fqdn will contain only the non-whitespace chars.

    ParserElements also come with their own runTests method that makes it easier to write quick tests for multiple inputs:

    options_server.runTests("""
        server example.com;
        server example.com   ;
        server example.com   .z;
        server example.com.;
    """)
    

    would print:

    server example.com;
    [['server', 'example.com']]
    [0]:
      ['server', 'example.com']
    
    
    server example.com   ;
    [['server', 'example.com']]
    [0]:
      ['server', 'example.com']
    
    
    server example.com   .z;
                         ^(FATAL)
    FAIL: Expected ";" (at char 21), (line:1, col:22)
    
    
    server example.com.;
                       ^(FATAL)
    FAIL: Expected domain (at char 19), (line:1, col:20)
    

    (All your tab test cases are not really being checked, since pyparsing by default expands tabs to spaces before starting parsing. You must call expr.parseWithTabs() to disable this feature.)