I have a series of conditional expressions as strings like: "('a' AND ('b' OR 'c')) OR ('d' AND ('e' OR ('g' AND 'h')))"
which are nested to a non-predictable depth. I have no control over this data; I need to be able to work with it.
I'd like python to be able to understand these structures so that I can manipulate them .
I've tried creating nested python lists by breaking at the parentheses (using pyparsing or unutbu's/falsetru's parser):
>>> str1="('a' AND ('b' OR 'c')) OR ('d' AND ('e' OR ('g' AND 'h')))"
>>> var1=parse_nested(text, r'\s*\(', r'\)\s*')
>>> var1
[["'a' AND", ["'b' OR 'c'"]], 'OR', ["'d' AND", ["'e' OR", ["'g' AND 'h'"]]]]
...But then to be honest I don't know how to get python to interpret the AND/OR relationships between the objects. I feel like i'm going completely the wrong direction with this.
My ideal output would be to be a data structure that maintains the relationship types between the entities in a nested way, so that (for example) it would be easy to create a json or YAML:
OR:
- AND:
- a
- OR:
- b
- c
- AND:
- d
- OR:
- e
- AND:
- g
- h
Assuming the quoted terms are syntactically valid python strings, it should be possible to use the tokenize module to parse the input. The only additonal requirement is that all embedded escapes are doubled. Also note that the string tokens returned by the tokeniser are always quoted, so they can be safely eval'd.
The demo script below provides a basic implementation. The output is a list of lists, with AND/OR
coverted to singleton types - but if you need a different structure it should be simple enough to adapt the code accordingly:
import tokenize, token as tokens
from io import StringIO
class ParseError(Exception):
def __init__(self, message, tokens):
super().__init__(f'{message}: {tokens}')
AND = type('AND', (object,), {'__repr__': lambda x: 'AND'})()
OR = type('OR', (object,), {'__repr__': lambda x: 'OR'})()
NAMES = {'AND': AND, 'OR': OR, 'TRUE': True, 'FALSE': False}
OPS = {tokens.LPAR, tokens.RPAR}
OTHER = {
tokens.ENDMARKER, tokens.INDENT, tokens.DEDENT,
tokens.NEWLINE, tokens.NL, tokens.COMMENT,
}
def parse_expression(string):
string = StringIO(string)
result = current = []
stack = [current]
for token in tokenize.generate_tokens(string.readline):
if (kind := token.type) == tokens.OP:
if token.exact_type == tokens.LPAR:
current.append(current := [])
stack.append(current)
elif token.exact_type == tokens.RPAR:
current = stack.pop()
else:
raise ParseError('invalid operator', token)
elif kind == tokens.NAME:
if (name := token.string.upper()) in NAMES:
current.append(NAMES[name])
else:
raise ParseError('invalid name', token)
elif kind == tokens.STRING or kind == tokens.NUMBER:
current.append(eval(token.string))
elif kind not in OTHER:
raise ParseError('invalid token', token)
return result
str1 = "('a' AND ('b' OR 'c')) OR ('d' AND ('e' OR ('g' AND 'h')))"
str2 = "('(AND)' AND (42 OR True)) or ('\\'d\\'' AND ('e' OR ('g' AND 'h')))"
print(parse_expression(str1))
print(parse_expression(str2))
Output:
[['a', AND, ['b', OR, 'c'], OR, ['d', AND, ['e', OR, ['g', AND, 'h']]]]]
[['(AND)', AND, [42, OR, True], OR, ["'d'", AND, ['e', OR, ['g', AND, 'h']]]]]