Search code examples
pythonjsonyacclexply

How to ignore tokens in ply.yacc


I'm writing a JSON configuration (i.e config file in JSON format) interpreter with PLY.

There are huge swaths of the configuration file that I'd like to ignore. Some parts that I'd like to ignore contain tokens that I can't ignore in other parts of the file.

For example, I want to ignore:

"features" : [{
    "name" : "someObscureFeature",
    "version": "1.2",
    "options": {
      "values" : ["a", "b", "c"]
      "allowWithoutContentLength": false,
      "enabled": true
    }
    ...
}]

But I do NOT want to ignore:

"features" : [{
    "name" : "importantFeature",        
    "version": "1.1",
    "options": {
      "value": {
        "id": 587842,
        "description": "ramy-single-hostmatch",
        "products": [
          "Fresca"
        ]
    ...
}]

There are also lots of other tokens within the array of features that I want to ignore if the name value is not 'importantFeature'. For example there is likely to be an array of values in both important and obscure features. I need to ignore accordingly.

Notice also that I need to extract certain elements of the values field and that I'd like the values field to be tokenized so I can make use of it. Effectively, I'd like to conditionally tokenize the values field if it's inside of an importantMatch.

Also note that importantFeature is just standing in for what will eventually be about a dozen different features, each with their own grammar inside of the their respective features blocks.

The problem I'm running into is that every feature, obviously, has a name. I'd like to write something along these lines:

def p_FEATURES(p):
    '''FEATURES : ARRAY_START FEATURE COMMA FEATURES ARRAY_END
                | ARRAY_START FEATURE ARRAY_END'''

def p_FEATURE(p):
    '''FEATURE : TESTABLE_FEATURE
               | UNTESTABLE_FEATURE'''

def p_TESTABLE_FEATURE(p):
    '''TESTABLE_FEATURE : BLOCK_START QUOTE NAME_KEY QUOTE COLON QUOTE CPCODE_FEATURE QUOTE COMMA IGNORE_KEY_VAL_PAIRS COMMA CPCODE_OPTIONS COMMA IGNORE_KEY_VAL_PAIRS'''

def p_UNTESTABLE_FEATURE(p):
    '''UNTESTABLE_FEATURE : IGNORE_BLOCK '''

def p_IGNORE_BLOCK(p):
    '''IGNORE_BLOCK : BLOCK_START LINES BLOCK_END'''

However the problem i'm running into is that I can't just "IGNORE_BLOCK" because the block with have a 'name' and I have a token in my lexer called 'name':

def t_NAME_KEY(t): r'name'; return t

Any help greatly appreciated.


Solution

  • When you define a regex rule function, you can choose whether or not to return the token. Depending on what is returned, the token is either ignored or considered. For example:

    def t_BLOCK(t):
        r'\{[\s]*name[\s]*:[\s]*(importantFeature)|(obscureFeature)\}' # will match a full block with the 'name' key in it
        if 'obscureFeature' not in t:
            return t
        else:
            pass
    

    You can build a rule somewhat along these lines, and then choose whether to return the token or not based on whether your important feature was present or not.

    Also, a general convention for specifying tokens to ignore as a string is to append t_IGNORE_ to the name.


    Based on OP's edit. Forget about elimination during tokenisation. What you could, instead do is, manually rebuild the json as you parse it with the grammar. For example.

    Replace

    def p_FEATURE(p):
        '''FEATURE : TESTABLE_FEATURE
                   | UNTESTABLE_FEATURE'''
    
    def p_TESTABLE_FEATURE(p):
        '''TESTABLE_FEATURE : BLOCK_START QUOTE NAME_KEY QUOTE COLON QUOTE CPCODE_FEATURE QUOTE COMMA IGNORE_KEY_VAL_PAIRS COMMA CPCODE_OPTIONS COMMA IGNORE_KEY_VAL_PAIRS'''
    
    def p_UNTESTABLE_FEATURE(p):
        '''UNTESTABLE_FEATURE : IGNORE_BLOCK '''
    

    with

    data = []
    
    def p_FEATURE(p):
        '''FEATURE : BLOCK_START DATA BLOCK_END FEATURE 
                   | BLOCK_START DATA BLOCK_END'''
    
    def p_DATA(p):
        '''DATA : KEY COLON VALUE COMMA DATA 
                | KEY COLON VALUE ''' # and so on (have another function for values)
    

    What you can do now is to examine p[2] and see if it is important. If yes, add it to your data variable. Else, ignore.

    This is just a rough idea. You'll still have to figure out the grammar rules exactly (for example, VALUE would also probably lead to another state), and adding the right blocks to data and how. But it is possible.