Search code examples
pythonpython-3.xparsingcompiler-constructionlexical-analysis

Split code into tokens (Lexing) in Python


I am attempting to tokenize the following code:

foo ::= 5
bar ::= 15
foobar ::= 20

so the output is:

['foo', '::=', '5', '\n', 'bar', '::=', '15', '/n', 'foobar', '::=', '20' ]

My current attempt is the following:

reTokens = re.compile(r' ')
tokens = reTokens.split(source)
print(tokens)

However this prints:

['\n', '', '', '', 'foo', '::=', '5\n', '', '', '', 'bar', '::=', '15\n', '', '', '', 'foobar', '::=', '20\n']

As you can see there is a lot of issues. A couple major problems is:

  1. Spaces are not being removed completely
  2. Certain tokens are not being split properly (i.e. "\n". adding \n to the regular expression does not solve the issue either as this removes it from the array completely).

Solution

  • You could do:

    from functools import reduce
    
    lines = source.splitlines()
    tokens_list = [line.strip().split() for line in lines]
    tokens = reduce(lambda x,y: x + ['\n'] + y, tokens_list)
    print(tokens)
    

    Which will divide the source to its lines, tokenize each line and then make them a single list with \ns in between each line.