Search code examples
pythongrammarpyparsing

PyParsing: Conditional parsing, depending on value


I need to parse Touchstone files (version 1.1, and version 2.0), but these have a strange rule in the syntax (see page 11 in the 1.1 spec, the top paragraph starting with Note.

So, I need to change the syntax rule from 'data points' to 'noise parameters', depending on the first float of the line. Like in this example:

! NETWORK PARAMETERS
2 .95 -26 3.57 157 .04 76 .66 -14
22 .60 -144 1.30 40 .14 40 .56 -85
! NOISE PARAMETERS (the down jump from 22 - linea above - to 4 - below, should trigger change of syntax)
4 .7 .64 69 .38
18 2.7 .46 -33 .40

(The lines starting with ! are comments and are optional)

There is no other parameter in the data file to help. (This only occurs in 'old' 1.x version of the spec. In the 2.0 version (which still has to be compatible with 1.*), a keyword was introduced).

How can I implement this in a single grammar? (I suspect the only solution is a line-by-line parser?)


Solution

  • This is probably a good case for using a parse action to detect when a new group of lines is found. It is possible to make a parser that dynamically redefines itself, but that is unnecessarily complicated here. For this case, we'll write a parser that reads all the lines of values, and then regroups them based on the "the first value starts a new group if it is less than or equal to the previous line's first value" rule.

    First we need a parser that parses all the lines. Since line endings are going to be significant in this parser, we'll have to redefine the default whitespace characters at the start (and define an NL expression that we can insert in the parser, since we'll have to explicitly parse them now):

    import pyparsing as pp
    
    pp.ParserElement.set_default_whitespace_chars(" ")
    NL = pp.LineEnd().suppress()
    

    I wanted to just use pyparsing's numeric string matcher/converter defined in pp.common.fnumber, but it does not accept floats that start with ".". So we define a Regex that suits your numeric values, and add a converter to convert to ints or floats at parse time:

    def str_to_num(s):
        """Function to convert str to int or float."""
        try:
            return int(s)
        except ValueError:
            return float(s)
    
    value = pp.Regex(r"-?(\d+(\.[0-9]+)?|\.[0-9]+)")
    # use parse action to convert numeric strings to int or float values
    value.add_parse_action(lambda t: str_to_num(t[0]))
    

    With these pieces in place, we can define the parser for these lines, using Group to keep each line's values separate, and ignore your comments (I'm also using the relatively new [...] and [1, ...] notation in place of ZeroOrMore and OneOrMore):

    data_line = pp.Group(value[1, ...]) + NL
    parameters = data_line[...]
    
    comment = "!" + pp.rest_of_line + NL
    data_line.ignore(comment)
    

    At this point, if we use parameters to parse your input, we get a single list of the lines of values, each line in a sub-group:

    [
      [2, 0.95, -26, 3.57, 157, 0.04, 76, 0.66, -14],
      [22, 0.6, -144, 1.3, 40, 0.14, 40, 0.56, -85],
      [4, 0.7, 0.64, 69, 0.38],
      [18, 2.7, 0.46, -33, 0.4],
    ]  
    

    Note that they are not parsed strings now, but have been converted to ints or floats.

    Here is a railroad diagram for that parser, created using the following added lines:

    pp.autoname_elements()
    parameters.create_diagram("diagram.html")
    

    parser railroad diagram

    To perform the regrouping, we'll add another parse action, this time on the parameters expression:

    def regroup_parameters(tokens):
        """Parse action to group parsed lines into named groups, detecting
        group breaks when the first value on a line is less than or equal
        to the first value of the previous line."""
    
        # keys are defined in the order they are expected to be found
        # in the inputs
        ret = {
            "network": [],
            "noise": [],
        }
    
        # put a large number here, so that the first line starts a
        # new group
        last_initial = 1e12
    
        # this iterator will cycle through the group names in the
        # ret dict as new groups are found
        keys_iter = iter(ret)
    
        # assign each parsed line of values to the current group,
        # or a new group if a group break is detected (based on
        # the first value of the line)
        for line in tokens:
            if line[0] <= last_initial:
                # new group detected, advance to next key
                ret_key = next(keys_iter)
            ret[ret_key].append(list(line))
            last_initial = line[0]
    
        # construct a new ParseResults from the dict
        return pp.ParseResults.from_dict(ret)
    
    
    parameters.add_parse_action(regroup_parameters)
    

    Now if we parse using this parser:

    result = parameters.parse_string(sample, parse_all=True)
    print(result.dump())
    

    we get:

    [[[2, 0.95, -26, 3.57, 157, 0.04, 76, 0.66, -14], ...
    - network: [
        [2, 0.95, -26, 3.57, 157, 0.04, 76, 0.66, -14], 
        [22, 0.6, -144, 1.3, 40, 0.14, 40, 0.56, -85]
        ]
    - noise: [
        [4, 0.7, 0.64, 69, 0.38], 
        [18, 2.7, 0.46, -33, 0.4]
        ]
    

    And you can access the fields directly:

    print(result.network)
    
    # [[2, 0.95, -26, 3.57, 157, 0.04, 76, 0.66, -14], [22, 0.6, -144, 1.3, 40, 0.14, 40, 0.56, -85]]