Search code examples
mathparsingcode-golfrosetta-stone

Code Golf: Mathematical expression evaluator (that respects PEMDAS)


I challenge you to write a mathematical expression evaluator that respects PEMDAS (order of operations: parentheses, exponentiation, multiplication, division, addition, subtraction) without using regular expressions, a pre-existing "Eval()"-like function, a parsing library, etc.

I saw one pre-existing evaluator challenge on SO (here), but that one specifically required left-to-right evaluation.

Sample inputs and outputs:

"-1^(-3*4/-6)" -> "1"

"-2^(2^(4-1))" -> "256"

"2*6/4^2*4/3" -> "1"

I wrote an evaluator in C#, but would like to see how badly it compares to those of smarter programmers in their languages of choice.

Related:

Code Golf: Evaluating mathematical expressions

Clarifications:

  1. Let's make this a function that accepts a string argument and returns a string result.

  2. As for why no regexes, well, that's to level the playing field. I think there ought to be a separate challenge for "the most compact regex".

  3. Using StrToFloat() is acceptable. By "parsing library" I meant to exclude such things as general-purpose grammar parsers, also to level the playing-field.

  4. Support floats.

  5. Support paretheses, exponentiation, and the four arithmetic operators.

  6. Give multiplication and division equal precedence.

  7. Give addition and subtraction equal precedence.

  8. For simplicity, you may assume all inputs are well-formed.

  9. I don't have a preference as to whether your function accepts such things as ".1" or "1e3" as valid numbers, but accepting them would earn you brownie points. ;)

  10. For divide-by-zero cases, you could perhaps return "NaN" (assuming you wish to implement error handling).


Solution

  • C (465 characters)

    #define F for(i=0;P-8;i+=2)
    #define V t[i
    #define P V+1]
    #define S V+2]),K(&L,4),i-=2)
    #define L V-2]
    K(double*t,int i){for(*++t=4;*t-8;*++t=V])*++t=V];}M(double*t){int i,p,b;
    F if(!P)for(p=1,b=i;i+=2,p;)P?P-1||--p||(P=8,M(t+b+2),K(t+b,i-b),i=b):++p;
    F P-6||(L=pow(L,S;F P-2&&P-7||(L*=(P-7?V+2]:1/S;F P-4&&(L+=(P-5?V+2]:-S;
    F L=V];}E(char*s,char*r){double t[99];char*e,i=2,z=0;for(;*s;i+=2)V]=
    strtod(s,&e),P=z=e-s&&z-4&&z-1?s=e,4:*s++&7;P=8;M(t+2);sprintf(r,"%g",*t);}
    

    The first five newlines are required, the rest are there just for readability. I've counted the first five newlines as one character each. If you want to measure it in lines, it was 28 lines before I removed all the whitespace, but that's a pretty meaningless number. It could have been anything from 6 lines to a million, depending on how I formatted it.

    The entry point is E() (for "evaluate"). The first parameter is the input string, and the second parameter points to the output string, and must be allocated by the caller (as per usual C standards). It can handle up to 47 tokens, where a token is either an operator (one of "+-*/^()"), or a floating point number. Unary sign operators do not count as a separate token.

    This code is loosely based on a project I did many years ago as an exercise. I took out all the error handling and whitespace skipping and retooled it using golf techniques. Below are the 28 lines, with enough formatting that I was able to write it, but probably not enough to read it. You'll want to #include <stdlib.h>, <stdio.h>, and <math.h> (or see note at the bottom).

    See after the code for an explanation of how it works.

    #define F for(i=0;P-8;i+=2)
    #define V t[i
    #define P V+1]
    #define S V+2]),K(&L,4),i-=2)
    #define L V-2]
    K(double*t,int i){
        for(*++t=4;*t-8;*++t=V])
            *++t=V];
    }
    M(double*t){
        int i,p,b;
        F if(!P)
            for(p=1,b=i;i+=2,p;)
                P?P-1||--p||(P=8,M(t+b+2),K(t+b,i-b),i=b):++p;
        F P-6||(L=pow(L,S;
        F P-2&&P-7||(L*=(P-7?V+2]:1/S;
        F P-4&&(L+=(P-5?V+2]:-S;
        F L=V];
    }
    E(char*s,char*r){
        double t[99];
        char*e,i=2,z=0;
        for(;*s;i+=2)
            V]=strtod(s,&e),P=z=e-s&&z-4&&z-1?s=e,4:*s++&7;
        P=8;
        M(t+2);
        sprintf(r,"%g",*t);
    }
    

    The first step is to tokenize. The array of doubles contains two values for each token, an operator (P, because O looks too much like zero), and a value (V). This tokenizing is what is done in the for loop in E(). It also deals with any unary + and - operators, incorporating them into the constant.

    The "operator" field of the token array can have one of the following values:

    0: (
    1: )
    2: *
    3: +
    4: a floating-point constant value
    5: -
    6: ^
    7: /
    8: end of token string

    This scheme was largely derived by Daniel Martin, who noticed that the last 3 bits were unique in the ASCII representation of each of the operators in this challenge.

    An uncompressed version of E() would look something like this:

    void Evaluate(char *expression, char *result){
        double tokenList[99];
        char *parseEnd;
        int i = 2, prevOperator = 0;
        /* i must start at 2, because the EvalTokens will write before the
         * beginning of the array.  This is to allow overwriting an opening
         * parenthesis with the value of the subexpression. */
        for(; *expression != 0; i += 2){
            /* try to parse a constant floating-point value */
            tokenList[i] = strtod(expression, &parseEnd);
    
            /* explanation below code */
            if(parseEnd != expression && prevOperator != 4/*constant*/ &&
               prevOperator != 1/*close paren*/){
                expression = parseEnd;
                prevOperator = tokenList[i + 1] = 4/*constant*/;
            }else{
                /* it's an operator */
                prevOperator = tokenList[i + 1] = *expression & 7;
                expression++;
            }
        }
    
        /* done parsing, add end-of-token-string operator */
        tokenList[i + 1] = 8/*end*/
    
        /* Evaluate the expression in the token list */
        EvalTokens(tokenList + 2); /* remember the offset by 2 above? */
    
        sprintf(result, "%g", tokenList[0]/* result ends up in first value */);
    }
    

    Since we're guaranteed valid input, the only reason the parsing would fail would be because the next token is an operator. If this happens, the parseEnd pointer will be the same as, tokenStart. We must also handle the case where parsing succeeded, but what we really wanted was an operator. This would occur for the addition and subtraction operators, unless a sign operator directly followed. In other words, given the expression "4-6", we want to parse it as {4, -, 6}, and not as {4, -6}. On the other hand, given "4+-6", we should parse it as {4, +, -6}. The solution is quite simple. If parsing fails OR the preceding token was a constant or a closing parenthesis (effectively a subexpression which will evaluate to a constant), then the current token is an operator, otherwise it's a constant.

    After tokenizing is done, calculating and folding are done by calling M(), which first looks for any matched pairs of parentheses and processes the subexpressions contained within by calling itself recursively. Then it processes operators, first exponentiation, then multiplication and division together, and finally addition and subtraction together. Because well-formed input is expected (as specified in the challenge), it doesn't check for the addition operator explicitly, since it's the last legal operator after all the others are processed.

    The calculation function, lacking golf compression, would look something like this:

    void EvalTokens(double *tokenList){
        int i, parenLevel, parenStart;
    
        for(i = 0; tokenList[i + 1] != 8/*end*/; i+= 2)
            if(tokenList[i + 1] == 0/*open paren*/)
                for(parenLevel = 1, parenStart = i; i += 2, parenLevel > 0){
                    if(tokenList[i + 1] == 0/*another open paren*/)
                        parenLevel++;
                    else if(tokenList[i + 1] == 1/*close paren*/)
                        if(--parenLevel == 0){
                            /* make this a temporary end of list */
                            tokenList[i + 1] = 8;
                            /* recursively handle the subexpression */
                            EvalTokens(tokenList + parenStart + 2);
                            /* fold the subexpression out */
                            FoldTokens(tokenList + parenStart, i - parenStart);
                            /* bring i back to where the folded value of the
                             * subexpression is now */
                            i = parenStart;
                        }
                }
    
        for(i = 0; tokenList[i + 1] != 8/*end*/; i+= 2)
            if(tokenList[i + 1] == 6/*exponentiation operator (^)*/){
                tokenList[i - 2] = pow(tokenList[i - 2], tokenList[i + 2]);
                FoldTokens(tokenList + i - 2, 4);
                i -= 2;
            }
        for(i = 0; tokenList[i + 1] != 8/*end*/; i+= 2)
            if(tokenList[i + 1] == 2/*multiplication operator (*)*/ ||
               tokenList[i + 1] == 7/*division operator (/)*/){
                tokenList[i - 2] *=
                    (tokenList[i + 1] == 2 ?
                        tokenList[i + 2] :
                        1 / tokenList[i + 2]);
                FoldTokens(tokenList + i - 2, 4);
                i -= 2;
            }
        for(i = 0; tokenList[i + 1] != 8/*end*/; i+= 2)
            if(tokenList[i + 1] != 4/*constant*/){
                tokenList[i - 2] +=
                    (tokenList[i + 1] == 3 ?
                        tokenList[i + 2] :
                        -tokenList[i + 2]);
                FoldTokens(tokenList + i - 2, 4);
                i -= 2;
            }
        tokenList[-2] = tokenList[0];
        /* the compressed code does the above in a loop, equivalent to:
         *
         * for(i = 0; tokenList[i + 1] != 8; i+= 2)
         *     tokenList[i - 2] = tokenList[i];
         *
         * This loop will actually only iterate once, and thanks to the
         * liberal use of macros, is shorter. */
    }
    

    Some amount of compression would probably make this function easier to read.

    Once an operation is performed, the operands and operator are folded out of the token list by K() (called through the macro S). The result of the operation is left as a constant in place of the folded expression. Consequently, the final result is left at the beginning of the token array, so when control returns to E(), it simply prints that to a string, taking advantage of the fact that the first value in the array is the value field of the token.

    This call to FoldTokens() takes place either after an operation (^, *, /, +, or -) has been performed, or after a subexpression (surrounded by parentheses) has been processed. The FoldTokens() routine ensures that the result value has the correct operator type (4), and then copies the rest of the larger expression of the subexpression. For instance, when the expression "2+6*4+1" is processed, EvalTokens() first calculates 6*4, leaving the result in place of the 6 (2+24*4+1). FoldTokens() then removes the rest of the sub expression "24*4", leaving 2+24+1.

    void FoldTokens(double *tokenList, int offset){
        tokenList++;
        tokenList[0] = 4; // force value to constant
    
        while(tokenList[0] != 8/*end of token string*/){
            tokenList[0] = tokenList[offset];
            tokenList[1] = tokenList[offset + 1];
            tokenList += 2;
        }
    }
    

    That's it. The macros are just there to replace common operations, and everything else is just golf-compression of the above.


    strager insists that the code should include #include statements, as it will not function correctly without a proper forward declation of the strtod and pow and functions. Since the challenge asks for just a function, and not a complete program, I hold that this should not be required. However, forward declarations could be added at minimal cost by adding the following code:

    #define D double
    D strtod(),pow();
    

    I would then replace all instances of "double" in the code with "D". This would add 19 characters to the code, bringing the total up to 484. On the other hand, I could also convert my function to return a double instead of a string, as did he, which would trim 15 characters, changing the E() function to this:

    D E(char*s){
        D t[99];
        char*e,i=2,z=0;
        for(;*s;i+=2)
            V]=strtod(s,&e),P=z=e-s&&z-4&&z-1?s=e,4:*s++&7;
        P=8;
        M(t+2);
        return*t;
    }
    

    This would make the total code size 469 characters (or 452 without the forward declarations of strtod and pow, but with the D macro). It would even be possible to trim 1 more characters by requiring the caller to pass in a pointer to a double for the return value:

    E(char*s,D*r){
        D t[99];
        char*e,i=2,z=0;
        for(;*s;i+=2)
            V=strtod(s,&e),P=z=e-s&&z-4&&z-1?s=e,4:*s++&7;
        P=8;
        M(t+2);
        *r=*t;
    }
    

    I'll leave it to the reader to decide which version is appropriate.