Search code examples
parsinggrammarpyparsing

How to write grammar for an expression when it can have many possible forms


I have some sentences that I need to convert to regex code and I was trying to use Pyparsing for it. The sentences are basically search rules, telling us what to search for.

Examples of sentences -

  1. LINE_CONTAINS this is a phrase -this is an example search rule telling that the line you are searching on should have the phrase this is a phrase

  2. LINE_STARTSWITH However we - this is an example search rule telling that the line you are searching on should start with the phrase However we

  3. The rules can be combined too, like- LINE_CONTAINS phrase one BEFORE {phrase2 AND phrase3} AND LINE_STARTSWITH However we

A list of all actual sentences (if necessary) can be found here.
All lines start with either of the 2 symbols mentioned above (call them line_directives). Now, I am trying to parse these sentences and then convert them to regex code. I started writing a BNF for my grammar and this is what I came up with -

lpar ::= '{'
rpar ::= '}'
line_directive ::= LINE_CONTAINS | LINE_STARTSWITH
phrase ::= lpar(?) + (word+) + rpar(?) # meaning if a phrase is parenthesized, its still the same

upto_N_words ::= lpar + 'UPTO' + num + 'WORDS' + rpar
N_words ::= lpar + num + 'WORDS' + rpar
upto_N_characters ::= lpar + 'UPTO' + num + 'CHARACTERS' + rpar
N_characters ::= lpar + num + 'CHARACTERS' + rpar

JOIN_phrase ::= phrase + JOIN + phrase
AND_phrase ::= phrase (+ JOIN + phrase)+
OR_phrase ::= phrase (+ OR + phrase)+
BEFORE_phrase ::= phrase (+ BEFORE + phrase)+
AFTER_phrase ::= phrase (+ AFTER + phrase)+

braced_OR_phrase ::= lpar + OR_phrase + rpar
braced_AND_phrase ::= lpar + AND_phrase + rpar
braced_BEFORE_phrase ::= lpar + BEFORE_phrase + rpar
braced_AFTER_phrase ::= lpar + AFTER_phrase + rpar
braced_JOIN_phrase ::= lpar + JOIN_phrase + rpar

rule ::= line_directive + subrule
final_expr ::= rule (+ AND/OR + rule)+

The problem is the subrule, for which (based on the empirical data I have) I have been able to come up with all of the following expressions -

subrule ::= phrase
        ::= OR_phrase
        ::= JOIN_phrase
        ::= BEFORE_phrase
        ::= AFTER_phrase
        ::= AND_phrase
        ::= phrase + upto_N_words + phrase
        ::= braced_OR_phrase + phrase
        ::= phrase + braced_OR_phrase
        ::= phrase + braced_OR_phrase + phrase
        ::= phrase + upto_N_words + braced_OR_phrase
        ::= phrase + upto_N_characters + phrase
        ::= braced_OR_phrase + phrase + upto_N_words + phrase
        ::= phrase + braced_OR_phrase + upto_N_words + phrase

To give an example, one sentence I have is LINE_CONTAINS the objective of this study was {to identify OR identifying} genes upregulated. For this the subrule as mentioned above is phrase + braced_OR_phrase + phrase.

So my question is how do I write a simple BNF grammar expression for the subrule so that I would be able to easily code the grammar for it using Pyparsing? Also, any input regarding my present technique is absolutely welcome.


EDIT: After applying the principles elucidated by @Paul in his answer, here is the MCVE version of the code. It takes a list of sentences to be parsed hrrsents, parses each sentence, converts it to it's corresponding regex and returns a list of regex strings -

from pyparsing import *
import re


def parse_hrr(hrrsents):
    UPTO, AND, OR, WORDS, CHARACTERS = map(Literal, "UPTO AND OR WORDS CHARACTERS".split())
    LBRACE,RBRACE = map(Suppress, "{}")
    integer = pyparsing_common.integer()

    LINE_CONTAINS, PARA_STARTSWITH, LINE_ENDSWITH = map(Literal,
        """LINE_CONTAINS PARA_STARTSWITH LINE_ENDSWITH""".split()) # put option for LINE_ENDSWITH. Users may use, I don't presently
    BEFORE, AFTER, JOIN = map(Literal, "BEFORE AFTER JOIN".split())
    keyword = UPTO | WORDS | AND | OR | BEFORE | AFTER | JOIN | LINE_CONTAINS | PARA_STARTSWITH

    class Node(object):
        def __init__(self, tokens):
            self.tokens = tokens

        def generate(self):
            pass

    class LiteralNode(Node):
        def generate(self):
            return "(%s)" %(re.escape(''.join(self.tokens[0]))) # here, merged the elements, so that re.escape does not have to do an escape for the entire list

    class ConsecutivePhrases(Node):
        def generate(self):
            join_these=[]
            tokens = self.tokens[0]
            for t in tokens:
                tg = t.generate()
                join_these.append(tg)
            seq = []
            for word in join_these[:-1]:
                if (r"(([\w]+\s*)" in word) or (r"((\w){0," in word): #or if the first part of the regex in word:
                    seq.append(word + "")
                else:
                    seq.append(word + "\s+")
            seq.append(join_these[-1])
            result = "".join(seq)
            return result

    class AndNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            join_these=[]
            for t in tokens[::2]:
                tg = t.generate()
                tg_mod = tg[0]+r'?=.*\b'+tg[1:][:-1]+r'\b)' # to place the regex commands at the right place
                join_these.append(tg_mod)
            joined = ''.join(ele for ele in join_these)
            full = '('+ joined+')'
            return full

    class OrNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            joined = '|'.join(t.generate() for t in tokens[::2])
            full = '('+ joined+')'
            return full

    class LineTermNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            ret = ''
            dir_phr_map = {
                'LINE_CONTAINS': lambda a:  r"((?:(?<=^)|(?<=[\W_]))" + a + r"(?=[\W_]|$))456", 
                'PARA_STARTSWITH':
                    lambda a: ( r"(^" + a + r"(?=[\W_]|$))457") if 'gene' in repr(a)
                    else (r"(^" + a + r"(?=[\W_]|$))458")}

            for line_dir, phr_term in zip(tokens[0::2], tokens[1::2]):
                ret = dir_phr_map[line_dir](phr_term.generate())
            return ret

    class LineAndNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            return '&&&'.join(t.generate() for t in tokens[::2])

    class LineOrNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            return '@@@'.join(t.generate() for t in tokens[::2])

    class UpToWordsNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            ret = ''
            word_re = r"([\w]+\s*)"
            for op, operand in zip(tokens[1::2], tokens[2::2]):
                # op contains the parsed "upto" expression
                ret += "(%s{0,%d})" % (word_re, op)
            return ret

    class UpToCharactersNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            ret = ''
            char_re = r"\w"
            for op, operand in zip(tokens[1::2], tokens[2::2]):
                # op contains the parsed "upto" expression
                ret += "((%s){0,%d})" % (char_re, op)
            return ret

    class BeforeAfterJoinNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            operator_opn_map = {'BEFORE': lambda a,b: a + '.*?' + b, 'AFTER': lambda a,b: b + '.*?' + a, 'JOIN': lambda a,b: a + '[- ]?' + b}
            ret = tokens[0].generate()
            for operator, operand in zip(tokens[1::2], tokens[2::2]):
                ret = operator_opn_map[operator](ret, operand.generate()) # this is basically calling a dict element, and every such element requires 2 variables (a&b), so providing them as ret and op.generate
            return ret

## THE GRAMMAR
    word = ~keyword + Word(alphas, alphanums+'-_+/()')
    uptowords_expr = Group(LBRACE + UPTO + integer("numberofwords") + WORDS + RBRACE).setParseAction(UpToWordsNode)
    uptochars_expr = Group(LBRACE + UPTO + integer("numberofchars") + CHARACTERS + RBRACE).setParseAction(UpToCharactersNode)
    some_words = OneOrMore(word).setParseAction(' '.join, LiteralNode)
    phrase_item = some_words | uptowords_expr | uptochars_expr

    phrase_expr = infixNotation(phrase_item,
                                [
                                ((BEFORE | AFTER | JOIN), 2, opAssoc.LEFT, BeforeAfterJoinNode), # was not working earlier, because BEFORE etc. were not keywords, and hence parsed as words
                                (None, 2, opAssoc.LEFT, ConsecutivePhrases),
                                (AND, 2, opAssoc.LEFT, AndNode),
                                (OR, 2, opAssoc.LEFT, OrNode),
                                ],
                                lpar=Suppress('{'), rpar=Suppress('}')
                                ) # structure of a single phrase with its operators

    line_term = Group((LINE_CONTAINS|PARA_STARTSWITH)("line_directive") +
                      (phrase_expr)("phrases")) # basically giving structure to a single sub-rule having line-term and phrase
    #
    line_contents_expr = infixNotation(line_term.setParseAction(LineTermNode),
                                       [(AND, 2, opAssoc.LEFT, LineAndNode),
                                        (OR, 2, opAssoc.LEFT, LineOrNode),
                                        ]
                                       ) # grammar for the entire rule/sentence
######################################
    mrrlist=[]
    for t in hrrsents:
        t = t.strip()
        if not t:
            continue
        try:
            parsed = line_contents_expr.parseString(t)
        except ParseException as pe:
            print(' '*pe.loc + '^')
            print(pe)
            continue


        temp_regex = parsed[0].generate()
        final_regexes3 = re.sub(r'gene','%s',temp_regex) # this can be made more precise by putting a condition of [non-word/^/$] around the 'gene'
        mrrlist.append(final_regexes3)
    return(mrrlist)

Solution

  • You have a two-tiered grammar here, so you would do best to focus on one tier at a time, which we have covered in some of your other questions. The lower tier is that of the phrase_expr, which will later serve as the argument to the line_directive_expr. So define examples of phrase expressions first - extract them from your list of complete statement samples. Your finished BNF for phrase_expr will have the lowest level of recursion look like:

    phrase_atom ::= <one or more types of terminal items, like words of characters 
                     or quoted strings, or *possibly* expressions of numbers of 
                     words or characters>  |  brace + phrase_expr + brace`
    

    (Some other questions: Is it possible to have multiple phrase_items one after another with no operator? What does that indicate? How should it be parsed? interpreted? Should this implied operation be its own level of precedence?)

    That will be sufficient to loop back the recursion for your phrase expression - you should not need any other braced_xxx element in your BNF. AND, OR, and JOIN are clearly binary operators - in normal operation precedence, AND's are evaluated before OR's, you can decide for yourself where JOIN should fall in this. Write some sample phrases with no parentheses, with AND and JOIN, and OR and JOIN, and think through what order of evaluation makes sense in your domain.

    Once that is done, then line_directive_expr should be simple, since it is just:

    line_directive_item ::= line_directive phrase_expr | brace line_directive_expr brace
    line_directive_and ::= line_directive_item (AND line_directive_item)*
    line_directive_or ::= line_directive_and (OR line_directive_and)*
    line_directive_expr ::= line_directive_or
    

    Then when you translate to pyparsing, add Groups and results names a little at a time! Don't immediately Group everything or name everything. Ordinarily I recommend using results names liberally, but in infix notation grammars, lots of results names can just clutter up the results. Let the Group (and ultimately node classes) do the structuring, and the behavior in the node classes will guide you where you want results names. For that matter, the results classes usually get such a simple structure that it is often easier just to do list unpacking in the class init or evaluate methods. Start with simple expressions and work up to complicated ones. (Look at "LINE_STARTSWITH gene" - it is one of your simplest test cases, but you have it as #97?) If you just sort this list by length order, that would be a good rough cut. Or sort by increasing number of operators. But tackling the complex cases before you have the simple ones working, you will have too many options on where a tweak or refinement should go, and (speaking from personal experience) you are as likely to get it wrong as get it right - except when you get it wrong, it just makes fixing the next issue more difficult.

    And again, as we have discussed elsewhere, the devil in this second tier is doing the actual interpretation of the various line directive items, since there is an implied order to evaluating LINE_STARTSWITH vs LINE_CONTAINS that overrides the order that they may be found in the initial string. That ball is entirely in your court, since you are the language designer for this particular domain.