Search code examples
pythonpandasextractevalexpr

find variables for pandas.eval() using regex or the same expression parsers as pandas


I've got a large number of variables that can be called in a script and evaluated by pandas.eval(). This is effectively a math evaluator. BUT, we have to assemble the dataframe from multiple data sources before executing eval(). This means I need to parse the equation we're given, execute our search methods and find the named variables, merge them into a dataframe, and finally execute the eval() method.

I tried using Regex to do this, which might still be the answer, but everything gets really tripped up when variables have spaces in the names. I'm a regex novice.

I then went into the pandas.evel() source code here: https://github.com/pandas-dev/pandas/blob/main/pandas/core/computation/expr.py to try and duplicate the methods that Pandas is using, since they've got to do the same thing. Turns out they don't have a good answer to the spaces in names thing either (here and here).

I still like the idea of using the same parsing as pandas, since that's the most robust solution. Pandas is using Expr from pandas.core.computation.expr (source). But I'm just getting all kinds of errors when trying to use that method. I can reconstruct the entire sequence of generating the context and pass it the same string that works in pandas.eval() (everything in this line: Expr(simpeval, engine=engine, parser=parser, env=env)) and I get pandas.core.computation.ops.UndefinedVariableError: name 'var1' is not defined. I don't get how pandas.eval() is doing on line 345 it and not getting this error.

I think I just need some more brains on this. I'm trying to get a list of all variables (col names with spaces allowed) from a string eval expression.

Ideal outcome:

in
eval_string = "(var1 / var with spaces 2) + var with spaces 3"
parsed_vars = magic_of_stack_overflow(eval_String)
parsed_vars

out
["var1","var with spaces 2","var with spaces 3"]

ideal ideal outcome: I can do the same thing, but use the exact parsers that I'm going to be passing the expressions to in pandas.

I'd be stoked with either of these.

edit: python 3.12 pandas 1.3.5


Solution

  • You could use a regex to identify the strings that start with a letter, don't have operators as characters and end with a non-space valid character:

    import re
    
    eval_string = "(var1 / var with spaces 2) + var with spaces 3"
    
    parsed_vars = re.findall(r'[a-zA-Z][^()/+-]+[^()/\-+ ]', eval_string)
    

    Output:

    ['var1', 'var with spaces 2', 'var with spaces 3']
    

    regex demo