Search code examples
pythonparsingpyparsing

Parse only relevant parts of block via PyParsing


I am using PyParsing to parse files that look like the following:

abc_module (
   .test_test1                       ( test_test_real1              ),
   .abc_test                         ( abc_test_real                ),
   .obs_test_one                     ( obs_test_one_real            ),
   .obs_test_two                     ( obs_test_two_real            )
);

cfg_module (
   .test_test2                       ( test_test_real2              ),
   .xyz_test                         ( xyz_test_real                )
);

xyz_module (
   .test_test2                       ( test_test_real2              ),
   .xyz_test                         ( xyz_test_real                ),
   .obs_test_three                   ( obs_test_three_real          ),
   .ahc_test                         ( ahc_test_real                ),
   .obs_test_four                    ( obs_test_four_real           )
);

I'm trying to pull out the module name of each codeblock in combination with optional lines that start with ".obs", so my desired ParseResult for this example file would be:

ParseResult = [["abc_module", ".obs_test_one", ".obs_test_two"], ["cfg_module"], ["xyz_module", ".obs_test_three", ".obs_test_four"]] 

So far I managed to parse the codeblocks, but the problem is that I am unable to pull out only .obs parts, I tried using PyParsings "skipTo" method, but I don't think that is the right tool for this.

My code so far:

keyword = Word(alphanums + '_' + ".")
sub_module_parser = keyword.setResultsName("module_name") + Suppress("(") + OneOrMore(
    keyword + Suppress("(" + keyword + ")" + Optional(","))).setResultsName("obs_param")

Solution

  • tl;dr - Best to not try to write a parser in one line, but step back and write up a plan first.

    Often I'll write the BNF using some kind of semi-formal notation, but I'm going to try something a little different. The BNF doesn't have to be super-rigorous, as long as it spells out your thinking about the structure of the text.

    This is a very clean structure, and no recursive nesting, so really lends itself to taking a few minutes and just writing down a plan for how the pyparsing code will look, before actually writing any code. It will also help you write a parser in logical groups - writing the whole parser in one line has no real benefit, and just makes it harder to understand and maintain later.

    To start, look for some logical, repetitive groups in your text. For instance, your submodules look like:

    (a name starting with a dot) "(" (an identifier) ")"
    

    Each module contains a list of these, separated by commas.

    The module itself looks like this:

    (an identifier) "(" (some submodules separated by commas) ")" ";" 
    

    Thinking in structures like this helps avoid some unhelpful misstemps, such as including the delimiting commas as part of the list item.

    So this is the BNF we'll work from:

    module IS (an identifier) "(" (some submodules separated by commas) ")" ";" 
    submodule IS (a name starting with a dot) "(" (an identifier) ")" 
    

    To translate to pyparsing, we'll need to start with the pieces, and eventually build back up to the overall module.

    First the punctuation. It's useful at parse time, but just clutter afterwards, so we suppress them from the parsed results:

    LPAR = Suppress("(")
    RPAR = Suppress(")")
    SEMI = Suppress(";")
    

    Next the identifiers. A common mistake when defining things like identifiers is just defining a Word with all the possible letters that could be found, such as Word(alphanums + '_'). This will match all the identifiers in your sample text, but would also match "___", "123", and "2_3_4", which I don't think you want to match. Word has a two-argument form to define the characters that are valid starting characters, and then then the valid body characters. Lets say your identifiers have to start with an alpha, and then any of the other characters - this will at least filter out integers or identifiers with no alphas:

    identifier = Word(alphas, alphanums+"_")
    

    I'm going to be lazy and use this trick to define the submodule identifier as:

    submodule_identifier = Word(".", alphanums+"_")
    

    See how this will work for the "name starting with a dot" for submodules.

    We can now define a submodule expression:

    submodule = submodule_identifier + LPAR + identifier + RPAR
    

    And module now is:

    module = identifier + LPAR + delimited_list(submodule) + RPAR
    

    Is it possible to have an empty module? If so, then the list of submodules should be wrapped in an Optional.

    module = identifier + LPAR + Optional(delimited_list(submodule)) + RPAR + SEMI
    

    That pretty much does it. Here is the full parser, with added Groups and names to help in working with the parsed results:

    LPAR = Suppress("(")
    RPAR = Suppress(")")
    SEMI = Suppress(";")
    
    identifier = Word(alphas, alphanums + "_").setName("identifier")
    submodule_identifier = Word(".", alphanums + "_").setName("submodule_identifier")
    submodule = Group(submodule_identifier("name") + LPAR + identifier("value") + RPAR).setName("submodule")
    submodule_list = delimited_list(submodule).setName("submodule_list")
    module = Group(identifier("name") + LPAR + Group(Optional(submodule_list))("submodules") + RPAR + SEMI).setName("module")
    

    Here is a diagram for this parser using the new railroad diagrams feature in pyparsing 3.

    module.create_diagram("module_railroad_diag.html")
    

    module parser railroad diagram

    And your parser will output this structure:

    print(module[...].parseString(sample).dump())
    
    [['abc_module', [['.test_test1', 'test_test_real1'], ['.abc_test', 'abc_test_real'], ...
    [0]:
      ['abc_module', [['.test_test1', 'test_test_real1'], ['.abc_test', 'abc_test_real'], ...
      - name: 'abc_module'
      - submodules: [['.test_test1', 'test_test_real1'], ['.abc_test', 'abc_test_real'], ...
        [0]:
          ['.test_test1', 'test_test_real1']
          - name: '.test_test1'
          - value: 'test_test_real1'
        [1]:
          ['.abc_test', 'abc_test_real']
          - name: '.abc_test'
          - value: 'abc_test_real'
        [2]:
          ['.obs_test_one', 'obs_test_one_real']
          - name: '.obs_test_one'
          - value: 'obs_test_one_real'
        [3]:
          ['.obs_test_two', 'obs_test_two_real']
          - name: '.obs_test_two'
          - value: 'obs_test_two_real'
    [1]:
      ['cfg_module', [['.test_test2', 'test_test_real2'], ['.xyz_test', 'xyz_test_real']]]
      - name: 'cfg_module'
      - submodules: [['.test_test2', 'test_test_real2'], ['.xyz_test', 'xyz_test_real']]
        [0]:
          ['.test_test2', 'test_test_real2']
          - name: '.test_test2'
          - value: 'test_test_real2'
        [1]:
          ['.xyz_test', 'xyz_test_real']
          - name: '.xyz_test'
          - value: 'xyz_test_real'
    [2]:
      ['xyz_module', [['.test_test2', 'test_test_real2'], ['.xyz_test', 'xyz_test_real'], ...
      - name: 'xyz_module'
      - submodules: [['.test_test2', 'test_test_real2'], ['.xyz_test', 'xyz_test_real'], ...
        [0]:
          ['.test_test2', 'test_test_real2']
          - name: '.test_test2'
          - value: 'test_test_real2'
        [1]:
          ['.xyz_test', 'xyz_test_real']
          - name: '.xyz_test'
          - value: 'xyz_test_real'
        [2]:
          ['.obs_test_three', 'obs_test_three_real']
          - name: '.obs_test_three'
          - value: 'obs_test_three_real'
        [3]:
          ['.ahc_test', 'ahc_test_real']
          - name: '.ahc_test'
          - value: 'ahc_test_real'
        [4]:
          ['.obs_test_four', 'obs_test_four_real']
          - name: '.obs_test_four'
          - value: 'obs_test_four_real'
    

    Lastly, you wanted to only show those submodules that start with ".obs". This is best done using a parse action, added to the submodule list to filter out just the ones you want:

    def obs_items_only(t):
        return ParseResults(item for item in t if item.name.startswith(".obs"))
    
    submodule_list.addParseAction(obs_items_only)
    

    After adding this filtering parse action, the results trim down to:

    [['abc_module', [['.obs_test_one', 'obs_test_one_real'], ['.obs_test_two', 'obs_test_two_real']]], ...
    [0]:
      ['abc_module', [['.obs_test_one', 'obs_test_one_real'], ['.obs_test_two', 'obs_test_two_real']]]
      - name: 'abc_module'
      - submodules: [['.obs_test_one', 'obs_test_one_real'], ['.obs_test_two', 'obs_test_two_real']]
        [0]:
          ['.obs_test_one', 'obs_test_one_real']
          - name: '.obs_test_one'
          - value: 'obs_test_one_real'
        [1]:
          ['.obs_test_two', 'obs_test_two_real']
          - name: '.obs_test_two'
          - value: 'obs_test_two_real'
    [1]:
      ['cfg_module', []]
      - name: 'cfg_module'
      - submodules: []
    [2]:
      ['xyz_module', [['.obs_test_three', 'obs_test_three_real'], ['.obs_test_four', 'obs_test_four_real']]]
      - name: 'xyz_module'
      - submodules: [['.obs_test_three', 'obs_test_three_real'], ['.obs_test_four', 'obs_test_four_real']]
        [0]:
          ['.obs_test_three', 'obs_test_three_real']
          - name: '.obs_test_three'
          - value: 'obs_test_three_real'
        [1]:
          ['.obs_test_four', 'obs_test_four_real']
          - name: '.obs_test_four'
          - value: 'obs_test_four_real'
    

    Exercise for the OP: with this parser, how would you add support for a submodule value that could also be an integer, which you could define as Word(nums)?