javascript computer-science tokenize lexer shunting-yard

Is there a simple way I can tokenize a string without a full-blown lexer?

I'm looking to implement the Shunting-yard Algorithm, but I need some help figuring out what the best way to split up a string into its tokens is.

If you notice, the first step of the algorithm is "read a token." This isn't exactly a non-trivial thing to do. Tokens can consist of numbers, operators and parens.

If you are doing something like:

(5+1)

A simple string.split() will give me an array of the tokens { "(", "5", "+", "1", ")" }.

However, it becomes more complicated if you have numbers with multiple digits such as:

((2048*124) + 42)

Now a naive string.split() won't do the trick. The multi-digit numbers are a problem.

I know I could write a lexer, but is there a way to do this without writing a full-blown lexer?

I'm implementing this in JavaScript and I'd like to avoid having to go down the lexer-path if possible. I'll be using the "*", "+", "-" and "/" operators, along with integers.

Solution

How about regular expressions? You could easily write regex to split it the way you want, and the JS string.split method accepts regex as the parameter too.

For example... (modify to include all chars you need etc)

/([0-9]+|[*+-\/()])/