I'm trying to make a very simple lexical analyzer(Tokenizer) for C++ code from scratch, without using PLY or any other library.
Now I'm trying to make a function check_line(line)
which will consume a line of code and return the tokens in a Dictionary. For example:
check_line('int main()')
The output should be:
Tokens = {'Keyword':'int', 'Keyword':'main', 'Opening parenthesis':'(','Closing Parenthesis':')'}
But the output I'm getting is:
Tokens = {'Keyword':'main', 'Keyword':'main', 'Opening parenthesis':'(','Closing Parenthesis':')'}
Is there a way to tackle something like this?
When I pass check_line('int main()')
inside the function, the program doesn't match main
because here we have parenthesis with it. How can I tackle this.
I'm pasting the code I've written so far, please have a look and let me know what you think.
import re
# Keywords
keywords = ['const','float','int','struct','break',
'continue','else','for','switch','void',
'case','enum','sizeof','typedef','char',
'do','if','return','union','while','new',
'public','class','friend','main']
# Regular Expression for Identifiers
re_id = '^[_]?[a-z]*[A-Z]([a-z]*[A-Z]*[0-9]+)'
# Regular Expression for Literals
re_int_lit = '^[+-]?[0-9]+'
re_float_lit = '^[+-]?([0-9]*)\.[0-9]+'
re_string_lit = '^"[a-zA-Z0-9_ ]+"$'
# Regular expression of Comments
re_singleline_comment = '^//[a-zA-Z0-9 ]*'
re_multiline_comment = '^/\\*(.*?)\\*/'
operators = {'=':'Assignment','-':'Subtraction',
'+':'Addition','*':'Multiplication',
'/':'Division','++':'increment',
'--':'Decrement','||':'OR', '&&':'AND',
'<<':'Cout operator','>>':'Cin Operator',
';':'End of statement'}
io = {'cin':'User Input',
'cout':'User Output'}
brackets = {'[':'Open Square',']':'Close Square',
'{':'Open Curly','}':'Close Curly',
'(':'Open Small',')':'Close Small'}
# Function
def check_line(line):
tokens = {}
words = line.split(' ')
for word in words:
if word in operators.keys():
tokens['Operator ' + word] = word
if word in keywords:
tokens['Keywords'] = word
if re.match(re_singleline_comment,word):
break
return tokens
check_line('int main()')
{'Keywords': 'main'}
Tokens = {'Keyword':'int', 'Keyword':'main', 'Opening parenthesis':'(','Closing Parenthesis':')'}
A dictionary is a really bad choice of data structure for this function, since the essence of a dictionary is that each key is associated with exactly one corresponding value.
What a tokenizer should return is quite different: an ordered stream of token objects. In a simple implementation, that might be a list of tuples, but for any non-trivial application, you'll soon find that:
Tokens are not just a syntactic type and a string. There's lots of important auxiliary information, most notably the location of the token in the input stream (for error messages).
Tokens are almost always consumed in sequence, and there is no particular advantage in producing more than one at a time. In Python, a generator is a much more natural way of producing a stream of tokens. If it were useful to create a list of tokens (for example, to implement a back-tracking parser), there would be no point working line by line, since line breaks are generally irrelevant in C++.
As noted in a comment, C++ tokens are not always separated by whitespace, as is evident in your example input. (main()
is three tokens without containing a single space character.) The best way of splitting program text into a token stream is to repeatedly match token patterns at the current input cursor, return the longest match, and move the input cursor over the match.