Python PLY Yacc - Parsing division of complex numbers

I'm implementing a calculator in Python to be able to do some maths on real numbers but also complex numbers.

I have a lexer/parser using PLY and I'm creating my own class for complex numbers, voluntarily ignoring the fact that Python already have a built-in type for complex numbers.

The parser works fine except for this case :

42i/2i

My parser is interpreting this case like this :

42i/2i = (42*i)/(2*i) = 21

Like you can see above, each complex number is seen like a block, with a real part inseparable from the imaginary part. But the mathematical truth is different. As you might know, this case should be treated as followed :

42i/2i = 42*i/2*i = -21

I should adapt my parser rules to get the correct result, but I cannot figure how. Here's a minimal reproducible example :

import ply.lex as lex
import ply.yacc as yacc

tokens = (
    'NUMBER',
    'DIVIDE',
    'IMAGINE',
)

########################## LEXER ##########################

t_DIVIDE    = r'\/'

def t_NUMBER(t):
    r'\d+(\.\d+)?'
    t.value = int(t.value)
    return t

def t_IMAGINE(t):
    r'i'
    t.value = Complex(0, 1)
    return t

def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

########################## PARSER #########################

def p_expression_binop(t):
    '''expression : expression DIVIDE expression'''
    t[0] = t[1] / t[3]
    print(t[0])

def p_expression_number(t):
    '''expression : NUMBER
                  | IMAGINE'''
    t[0] = t[1]

def p_expression_imaginary(t):
    '''expression : NUMBER IMAGINE'''
    t[0] = t[1] * t[2]

def p_error(t):
    print("Syntax error!")

###################### COMPLEX CLASS ######################

class Complex(object):
    def __init__(self, real, imag=0):
        self.real = real
        self.imag = imag

    def __str__(self):
        string = ''
        if self.real != 0:
            if self.real % 1 == 0 : self.real = int(self.real)
            string += str(self.real)
        if self.imag != 0:
            if self.imag % 1 == 0 : self.imag = int(self.imag)
            if self.real != 0:
                string += ' + ' if self.imag > 0 else ' - '
            else:
                string += '' if self.imag > 0 else '-'
            if abs(self.imag) != 1:
                string += str(abs(self.imag)) + 'i'
            else:
                string += 'i'
        return string or '0'

    def __mul__(self, other):
        return Complex(self.real * other.real - self.imag * other.imag,
                       self.imag * other.real + self.real * other.imag)

    def __rmul__(self, other):
        return self.__mul__(other)

    def __truediv__(self, other):
        if isinstance(other, (float,int)):
            other = Complex(other)
        s1, s2, o1, o2 = self.real, self.imag, other.real, other.imag
        r = float(o1 ** 2 + o2 ** 2)
        return Complex((s1 * o1 + s2 * o2) / r, ( s2 * o1 - s1 * o2) / r)

    def __rtruediv__(self, other):
        if isinstance(other, (float,int)):
            other = Complex(other)
        return other.__truediv__(-self)

########################## MAIN ##########################

lexer = lex.lex() 
while True:
    try:
        s = raw_input('> ')
    except:
        s = input('> ')
    if s:
        parser = yacc.yacc()
        parser.parse(s)

Any help ? Thanks a lot !

Solution

There's one thing missing from your program:

precedence = (
    ('left', 'DIVIDE'),
)

Without that, Ply generates a shift-reduce conflict when it generates the parser, because of the production

expression : expression DIVIDE expression

I mention that only because the issue at hand is one of operator precedence, although the operator in question is invisible: the implicit multiplication operator embodied in the production:

expression : NUMBER IMAGINE

The second production doesn't cause any parsing conflicts, even without a precedence declaration, but that's because it is not general enough to actually parse useful expressions. Consider, for example, your example

42i/4i

As you say, your parser interprets that as

[expression: [expression: 42 i]
             /
             [expression: 4 i] ]

But you want it to be interpreted as:

[expression: [expression: [expression: 42 i]
                          /
                          [expression: 4]
              i]
]

But it's evident that the outermost expression cannot be generated from your grammar, because your grammar requires that the left-hand operator of implicit multiplication be a NUMBER, and in this parse it is clearly an expression.

Before we just add the production

expression : expression IMAGINE

we should probably try to imagine all the possible use cases. And one of them should spring to mind immediately: 4i² (leaving out the detail of which operator you choose to represent exponentiation).

The incorrect "fused number-imaginary" grammar will see that as the square of 4i (i.e. -16), but the generally-accepted parse is four times the square of i (i.e. -4). That would correspond to the parse

[expression: [expression: 4]
             [expression: [expression: i]
                          POWER
                          [expression: 2]]]

in which the right-hand argument of implicit multiplication isn't an imaginary, but an expression.

So your first task is to decide how general implicit multiplication is going to be in your language. Are all of the following valid? (Some require that your lexer discard whitespace)

4i
i4
(4*5)i
4(7+i)
(4+i)(7-i)
i i
4 i
4 7

You might be suspicious of a couple of the last ones, but I venture to guess that most of these won't cause any surprise. We can later look at how to reject the two-consecutive-number case, if desired, but it seems like the most reasonable implicit multiplication rule is at least similar to the most general:

expression : expression expression

and furthermore, that it has exactly the same precedence and associativity as division or explicit multiplication.

But that leaves us with a problem: How can you put an operator in the precedence table when there is no operator? And the short answer is, you can't. There are some tricks which more or less work, but far and away the simplest and most readable alternative is to write the grammar so that the precedence is explicit.

I won't go into this style more because it's been beaten to death elsewhere and you'll have no problem finding examples all over the internet (most of which use non-terminals called term and factor, which I'm not using here because I think their meanings are sufficiently obscure that many people get them backwards). I'll just write out the grammar, PLY-style, with action functions.

And one note about the action functions: Ply lets you combine two productions with the same action. That's convenient but it doesn't mean that you should do things like this:

def p_additive(p):
    """ additive : additive '+' multiplicative
        additive : additive '-' multiplicative
        multiplicative : multiplicative '*' value
        multiplicative : multiplicative '/' value
        ...
    """
    if p[2] == '+' then:
        p[0] = p[1] + p[3]
    else if p[2] == '-' then:
        p[0] = p[1] - p[3]}
    else if p[2] == '*' then:
        p[0] = p[1] * p[3]}
    else if p[2] == '/' then:
        p[0] = p[1] / p[3]}

To see how silly that is, consider the process by which the parser got there. First, the parser figured out exactly which production was matched. Suppose that production was expression : additive '-' multiplicative. Then it looks up the correspondence from productions to functions, and finds the above function (which is exactly the same function that it would have found had the production involved any other binary operator). So it calls that function, at which point the information about which production matched has effectively been lost.

That means that the first thing the function has to do is figure out which production was reduced, something which was already figured out. And it proceeds to do that by checking the operators one at a time against the ones it knows about, which is a complete waste of cycles.

I hope that's sufficient to explain why the following grammar does not use functions like that. If you're writing a Ply parser action function and you find yourself using an if statement to check which of the productions was matched, you should immediately think about using two (or more) different action functions instead. (If the actions have common features, consider factoring those out into a subfunction. It's likely that the result is more readable and more maintainable.)

Note: I use Python's complex numbers here. Whatever the reason you didn't, it has zero impact on the parsing exercise, other than to add a bunch of code to the parser.

import ply.lex as lex
import ply.yacc as yacc

### Lexer

# This lexer uses literal character tokens wherever possible. They're
# easier, faster, and more readable.
# (https://www.dabeaz.com/ply/ply.html#ply_nn11)

literals = '()+-*/^i'
t_ignore = ' \t'
tokens = ( 'NUMBER', )

def t_NUMBER(t):
    "(?:\d+(?:\.\d*)?|\.\d+)(?:[Ee][+-]?\d+)?"
    t.value = float(t.value)
    return t

def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

### Parser

# Use this function for any unit production which just passes
# its value through.
def p_unit(p): 
    """ expression : additive
        additive : multiplicative
        multiplicative : exponential
        exponential : value
        value : NUMBER
    """
    p[0] = p[1]

def p_plus(p):
    """ additive : additive '+' multiplicative """
    p[0] = p[1] + p[3]

def p_minus(p):
    """ additive : additive '-' multiplicative """
    p[0] = p[1] - p[3]

# Compare this production with the next one.
def p_implicit_times(p):
    """ multiplicative : multiplicative exponential """
    p[0] = p[1] * p[2]


def p_times(p):
    """ multiplicative : multiplicative '*' exponential """
    p[0] = p[1] * p[3]

# TODO: catch errors
def p_divide(p):
    """ multiplicative : multiplicative '/' exponential """
    p[0] = p[1] / p[3]

# This one is right-associative.
# TODO: catch errors
def p_power(p):
    """ exponential : value '^' exponential """
    p[0] = p[1] ** p[3]

def p_i(p):
    """ value : 'i' """
    p[0] = 1J   # Python's way of writing 0+1i

def p_paren(p):
    """ value : '(' expression ')' """
    p[0] = p[2]

def p_error(t):
    print("Syntax error at " + str(t))

### Very simple driver

import sys
lexer = lex.lex()
parser = yacc.yacc()
while True:
    try:
        print(parser.parse(input('> ')))
    except EOFError:
        break
    except:
        print(sys.exc_info()[1])

It might be useful to look at the grammar, which I extracted from parser.out:

Rule 1     expression -> additive
Rule 2     additive -> multiplicative
Rule 3     multiplicative -> exponential
Rule 4     exponential -> value
Rule 5     value -> NUMBER
Rule 6     additive -> additive + multiplicative
Rule 7     additive -> additive - multiplicative
Rule 8     multiplicative -> multiplicative exponential
Rule 9     multiplicative -> multiplicative * exponential
Rule 10    multiplicative -> multiplicative / exponential
Rule 11    exponential -> value ^ exponential
Rule 12    value -> i
Rule 13    value -> ( expression )

OK, now let's consider the one important case where we do need to modify the grammar for implicit multiplication in order to avoid a conflict. The problem is the innocuous-looking expression 4 - 2. Now, suppose that our grammar implemented unary minus (which the above grammar does not). In that case, there would be two ways of parsing the expression: as a subtraction, and as the implicit product of two subexpressions, 4 and -2.

Obviously, the second interpretation is wrong, and that means that we cannot allow a unary expression to be the second argument of the implicit multiplication production.

Fortunately, we've already abandoned the idea of trying to use precedence declarations to disambiguate our grammar. If we were trying to figure out how to modify

expression : expression expression

so that the second expression couldn't start with a unary minus operator, we'd be in big trouble.

As another stroke of luck, in standard algebraic notation the exponentiation operator is right-associative and also binds more tightly on the left than unary minus, so that -2⁴ evaluates to -16 (-(2⁴)) and not 16 ((-2)⁴).

When we put that together, the modification to allow unary minus turns out to be almost trivial. We insert unary minus in the precedence hierarchy where it logically belongs, at the same level as exponentiation. [See note 1] But we need to make one exception, so that the implicit multiplication production cannot have a unary minus expression as its right-hand argument. To do that, we need to split the level into two pieces, like this:

Rule 1     expression -> additive
Rule 2     additive -> multiplicative
Rule 3     multiplicative -> unary
Rule 4     unary -> exponential
Rule 5     exponential -> value
Rule 6     value -> NUMBER
Rule 7     additive -> additive + multiplicative
Rule 8     additive -> additive - multiplicative
Rule 9     multiplicative -> multiplicative exponential
Rule 10    multiplicative -> multiplicative * unary
Rule 11    multiplicative -> multiplicative / unary
Rule 12    unary -> - unary
Rule 13    exponential -> value ^ unary
Rule 14    value -> i
Rule 15    value -> ( expression )

Notes

You'll find many grammars which insist on separate precedence levels but if you think about it a bit, you'll see that since exponentiation is right-associative it can share a precedence level with unary minus. If that's not clear, it's probably additional evidence for the fact that precedence declarations tend to create confusion except in very simple uses.