Search code examples
javascriptregexparsinglexical-analysismathematical-expressions

How to implement Lexical Analysis in Javascript


Hey folks, thanks for reading

I am currently attempting to do a Google-style calculator. You input a string, it determines if it can be calculated and returns the result.

I began slowly with the basics : + - / * and parenthesis handling.

I am willing to improve the calculator over time, and having learned a bit about lexical analysis a while ago, I built a list of tokens and associated regular expression patterns.

This kind of work is easily applicable with languages such as Lex and Yacc, except I am developping a Javascript-only application.

I tried to transcript the idea into Javascript but I can't figure out how to handle everything in a clean and beautiful way, especially nested parenthesis.


Analysis

Let's define what a calculator query is:

// NON TERMINAL EXPRESSIONS //
query     -> statement
query     -> ε // means end of query

statement -> statement operator statement
statement -> ( statement )
statement -> prefix statement
statement -> number

number    -> integer
number    -> float

// TERMINAL EXPRESSIONS //
operator  -> [+*/%^-]

prefix    -> -

integer   -> [0-9]+

float     -> [0-9]+[.,][0-9]+

Javascript

Lexical Analysis consists in verifying there is nothing that doesn't look like one of the terminal expressions : operator, prefixes, integer and float. Which can be shortened to one regular expression:

(I added spaces to make it more readable)

var calcPat = 
/^ (\s*
    ( ([+/*%^-]) | ([0-9]+) | ([0-9]+[.,][0-9]+) | (\() | (\)) )
)+ \s* $/;

If this test passes, query is lexically correct and needs to be grammar-checked to determine if it can be calculated. This is the tricky part

I am not going to paste code because it is not clean nor easily understandable, but I am going to explain the process I followed and why I'm stuck:

I created a method called isStatement(string) that's supposed to call itself recursively. The main idea is to split the string into 'potential' statements and check if they really are statements and form one altogether.
Process is the following:

-If the first two tokens are a number followed by an operator:

-Then,
-- If the remaining is just one token and it is a number:
--- Then this is a statement.
--- Else, check if the remaining tokens form a statement (recursive call)

-Else, If the first token is a parenthesis
-Then, Find matching closing parenthesis and check if what's inside is a statement (recursion)
-- Also check if there is something after closing parenthesis and if it forms a statement when associated with the parenthesis structure.


What's the problem ?

My problem is that I cannot find matching parenthesis when there is nested structures. How can I do that ? Also, as you can see, this is not a particurlarly generic and clean grammar-checking algorithm. Do you have any idea to improve this pattern ?

Thank you so much for having taken the time to read everything. Gael

(PS: As you probably noticed, I am not a native english speaker ! Sorry for mistakes and all !)


Solution

  • You've got the right idea about what lexical analysis is, but you seem to have gotten confused about the distinction between the token grammar and the language grammar. Those are two different things.

    • The token grammar is the set of patterns (usually regular expressions) that describe the tokens for the language to be parsed. The regular expressions are expressions over a character set.

    • The language grammar (or target grammar, I suppose) is the grammar for the language you want to parse. This grammar is expressed in terms of tokens.

    You cannot write a regular expression to parse algebraic notation. You just can't. You can write a grammar for it, but it's not a regular grammar. What you want to do is recognize separate tokens, which in your case could be done with a regular expression somewhat like what you've got. The trick is that you're not really applying that expression to the overall sentence to be parsed. Instead, you want to match a token at the current point in the sentence.

    Now, because you've got Javascript regular expressions to work with, you could come up with a regular expression designed to match a string of tokens. The trick with that will be coming up with a way to identify which token was matched out of the list of possibilities. The Javascript regex engine can give you back arrays of groups, so maybe you could build something on top of that.

    edit — I'm trying to work out how you could put together a (somewhat) general-purpose tokenizer builder, starting from a list of separate regular expressions (one for each token). It's possibly not very complicated, and it'd be pretty fun to have around.