Search code examples
pythonregexpython-itertools

More elegant way to implement regexp-like quantifiers


I'm writing a simple string parser which allows regexp-like quantifiers. An input string might look like this:

s = "x y{1,2} z"

My parser function translates this string to a list of tuples:

list_of_tuples = [("x", 1, 1), ("y", 1, 2), ("z", 1, 1)]

Now, the tricky bit is that I need a list of all valid combinations that are specified by the quantification. The combinations all have to have the same number of elements, and the value None is used for padding. For the given example, the expected output is

[["x", "y", None, "z"], ["x", "y", "y", "z"]]

I do have a working solution, but I'm not really happy with it: it uses two nested for loops, and I find the code somewhat obscure, so there's something generally awkward and clumsy about it:

import itertools

def permute_input(lot):
    outer = []
    # is there something that replaces these nested loops?
    for val, start, end in lot:
        inner = []
        # For each tuple, create a list of constant length
        # Each element contains a different number of 
        # repetitions of the value of the tuple, padded
        # by the value None if needed.
        for i in range(start, end + 1):
            x = [val] * i + [None] * (end - i)
            inner.append(x)
        outer.append(inner)
    # Outer is now a list of lists.

    final = []
    # use itertools.product to combine the elements in the
    # list of lists:
    for combination in itertools.product(*outer):
        # flatten the elements in the current combination,
        # and append them to the final list:
        final.append([x for x 
                    in itertools.chain.from_iterable(combination)])
    return final

print(permute_input([("x", 1, 1), ("y", 1, 2), ("z", 1, 1)]))
[['x', 'y', None, 'z'], ['x', 'y', 'y', 'z']]

I suspect that there's a much more elegant way of doing this, possibly hidden somewhere in the itertools module?


Solution

  • One alternative way to approach the problem is to use pyparsing and this example regex parser that would expand a regular expression to possible matching strings. For your x y{1,2} z sample string it would generate two possible strings expanding the quantifier:

    $ python -i regex_invert.py 
    >>> s = "x y{1,2} z"
    >>> for item in invert(s):
    ...     print(item)
    ... 
    x y z
    x yy z
    

    The repetition itself supports both an open-ended range and a closed range and is defined as:

    repetition = (
        (lbrace + Word(nums).setResultsName("count") + rbrace) |
        (lbrace + Word(nums).setResultsName("minCount") + "," + Word(nums).setResultsName("maxCount") + rbrace) |
        oneOf(list("*+?"))
    )
    

    To get to the desired result, we should modify the way the results are yielded from the recurseList generator and return lists instead of strings:

    for s in elist[0].makeGenerator()():
        for s2 in recurseList(elist[1:]):
            yield [s] + [s2]  # instead of yield s + s2
    

    Then, we need to only flatten the result:

    $ ipython3 -i regex_invert.py 
    
    In [1]: import collections
    
    In [2]: def flatten(l):
       ...:     for el in l:
       ...:         if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):
       ...:             yield from flatten(el)
       ...:         else:
       ...:             yield el
       ...:             
    
    In [3]: s = "x y{1,2} z"
    
    In [4]: for option in invert(s):
       ...:     print(list(flatten(option)))
       ...: 
    ['x', ' ', 'y', None, ' ', 'z']
    ['x', ' ', 'y', 'y', ' ', 'z']
    

    Then, if needed, you can filter the whitespace characters:

    In [5]: for option in invert(s):
       ...:     print([item for item in flatten(option) if item != ' '])
       ...:     
    ['x', 'y', None, 'z']
    ['x', 'y', 'y', 'z']