python python-3.x parsing compiler-construction lexical-analysis

Split code into tokens (Lexing) in Python

I am attempting to tokenize the following code:

foo ::= 5
bar ::= 15
foobar ::= 20

so the output is:

['foo', '::=', '5', '\n', 'bar', '::=', '15', '/n', 'foobar', '::=', '20' ]

My current attempt is the following:

reTokens = re.compile(r' ')
tokens = reTokens.split(source)
print(tokens)

However this prints:

['\n', '', '', '', 'foo', '::=', '5\n', '', '', '', 'bar', '::=', '15\n', '', '', '', 'foobar', '::=', '20\n']

As you can see there is a lot of issues. A couple major problems is:

Spaces are not being removed completely
Certain tokens are not being split properly (i.e. "\n". adding \n to the regular expression does not solve the issue either as this removes it from the array completely).

Solution

You could do:

from functools import reduce

lines = source.splitlines()
tokens_list = [line.strip().split() for line in lines]
tokens = reduce(lambda x,y: x + ['\n'] + y, tokens_list)
print(tokens)

Which will divide the source to its lines, tokenize each line and then make them a single list with \ns in between each line.