Search code examples
phpregexexpression-trees

Regex Expression to split mix of mixed expression


I am trying to split the following expression into each array so that I can use the shunting yard algorithm to convert into postfix and evaluate later on. Here is the part of the string.

    $string = '(fld_1010=="t" or fld_1010 != "test") and fld_1012 >= "18"

I am using the following pattern

$pattern = "/([\(|\s]*)(fld_)([0-9]*)[\s]*(!=|==|>=|<=|=|>|<|like|in)(.*?)([\)|\s]*)( and| or|\z)/";
$found preg_match_all($pattern , $string , $result,PREG_SET_ORDER);

print_r($result);

But I got this output :

[
    [
        "(fld_1010==\"t\" or",
        "(",
        "fld_",
        "1010",
        "==",
        "\"t\"",
        "",
        " or"
    ],
    [
        " fld_1010 != \"test\") and",
        " ",
        "fld_",
        "1010",
        "!=",
        " \"test\"",
        ")",
        " and"
    ],
    [
        " fld_1012 >= \"18\"",
        " ",
        "fld_",
        "1012",
        ">=",
        " \"18\"",
        "",
        ""
    ]
]

How could I split a string like this?

[
"(",
"fld_1010",
"==",
"t",
"or",
"fld_1010",
"!=",
"test",
")",
"and",
"fld_1012",
">=",
"18"
]

I am following this link but it only applies to mathematical expression with numbers only.

Thank you.


Solution

  • You should tackle this in phases. The first phase would indeed be to tokenize the input, but you should not try to use this step to verify the order of the tokens is valid. Just focus on the individual token syntax, without focussing on the context in which these tokens occur. So don't check yet whether the parentheses are balanced, or whether operators occur between two operands, ...etc.

    One other thing to change is the last argument you pass to preg_match_all: use PREG_PATTERN_ORDER. That way you get all the matches together in one subarray, and all the potential capture groups will be collected in separate subarrays.

    I would reserve a capture group for catching anything that does not fit any of the patterns. This will then be an indication of a syntax error.

    Here is how you could do that:

    $string = '(fld_1010=="t" or fld_1010 != "test") and fld_1012 >= "18"';
    
    // This pattern does not verify any order; just the valid tokens.
    // The final (\S+) is a "catchall" for errors:
    $pattern = '/[!=<>]=|[<>()]|\b(?:like|in|and|or|fld_[0-9]*)\b|"[^"]*"|(\S+)/';
    // Use PREG_PATTERN_ORDER here
    $found = preg_match_all($pattern , $string , $result, PREG_PATTERN_ORDER);
    // Extract the second subarray, as it will have the matches with (\S+):
    $errors = array_filter($result[1]);
    if ($errors) {
        echo "following tokens are invalid:\n";
        print_r($errors);
    }
    $result = $result[0]; // just get the matches
    print_r($result); // This outputs what you were looking for.
    

    Notice that for string literals I did not do anything to allow for double quotes to be part of them (with some escape character). If you need this, you will need to extend the regular expression to cope with that.

    The second phase will be to verify that these tokens appear in a valid order. I would not try to do this with regular expressions, but with PHP code. Expressions can become very complex, with a lot of nested parentheses, potential function calling (like maybe "abs()"), unary operators (like "+" or "not") and binary operators, precedence rules (e.g. multiplication before addition), associativity rules (e.g. exponentiation happening from right to left), ...etc.

    Another implementation

    Just for reference, I want to point to a Shunting-Yard implementation I once did in JavaScript, where all operators and functions are defined dynamically. Maybe that goes too far for your purposes, but it might serve as an inspiration.