I have to translate lexical analyzer the code in Sebesda's Concpets of Programming Languages (chapter 4, section 2) to python. Here's what I have so far:
# Character classes #
LETTER = 0
DIGIT = 1
UNKNOWN = 99
# Token Codes #
INT_LIT = 10
IDENT = 11
ASSIGN_OP = 20
ADD_OP= 21
SUB_OP = 22
MULT_OP = 23
DIV_OP = 24
LEFT_PAREN = 25
RIGHT_PAREN = 26
charClass = ''
lexeme = ''
lexLen = 0
token = ''
nextToken = ''
### lookup - function to lookup operators and parentheses ###
### and return the token ###
def lookup(ch):
def left_paren():
addChar()
globals()['nextToken'] = LEFT_PAREN
def right_paren():
addChar()
globals()['nextToken'] = RIGHT_PAREN
def add():
addChar()
globals()['nextToken'] = ADD_OP
def subtract():
addChar()
globals()['nextToken'] = SUB_OP
def multiply():
addChar()
globals()['nextToken'] = MULT_OP
def divide():
addChar()
globals()['nextToken'] = DIV_OP
options = {')': right_paren, '(': left_paren, '+': add,
'-': subtract, '*': multiply , '/': divide}
if ch in options.keys():
options[ch]()
else:
addChar()
### addchar- a function to add next char to lexeme ###
def addChar():
#lexeme = globals()['lexeme']
if(len(globals()['lexeme']) <=98):
globals()['lexeme'] += nextChar
else:
print("Error. Lexeme is too long")
### getChar- a function to get the next Character of input and determine its character class ###
def getChar():
globals()['nextChar'] = globals()['contents'][0]
if nextChar.isalpha():
globals()['charClass'] = LETTER
elif nextChar.isdigit():
globals()['charClass'] = DIGIT
else:
globals()['charClass'] = UNKNOWN
globals()['contents'] = globals()['contents'][1:]
## getNonBlank() - function to call getChar() until it returns a non whitespace character ##
def getNonBlank():
while nextChar.isspace():
getChar()
## lex- simple lexical analyzer for arithmetic functions ##
def lex():
globals()['lexLen'] = 0
getNonBlank()
def letterfunc():
globals()['lexeme'] = ''
addChar()
getChar()
while(globals()['charClass'] == LETTER or globals()['charClass'] == DIGIT):
addChar()
getChar()
globals()['nextToken'] = IDENT
def digitfunc():
globals()['lexeme'] = ''
addChar()
getChar()
while(globals()['charClass'] == DIGIT):
addChar()
getChar()
globals()['nextToken'] = INT_LIT
def unknownfunc():
globals()['lexeme'] = ''
lookup(nextChar)
getChar()
lexDict = {LETTER: letterfunc, DIGIT: digitfunc, UNKNOWN: unknownfunc}
if charClass in lexDict.keys():
lexDict[charClass]()
print('The next token is: '+ str(globals()['nextToken']) + ' The next lexeme is: ' + globals()['lexeme'])
with open('input.txt') as input:
contents = input.read()
getChar()
lex()
while contents:
lex()
I'm using the string sum + 1 / 33
as my sample input string. From what I understand, the first call to getChar() at the top level sets the characterClass to LETTER andcontents
to um + 1 / 33
.
The program then enters the while loop and calls lex()
. lex()
in turn accumulates the word sum in to lexeme
. When the while loop inside letterfunc
encounters the first white-space character, it breaks, exiting lex()
Since contents
is not empty, the program goes through the while loop at the top level again. This time, the getNonBlank()
call inside lex()
"throws out the spaces in contents
and the same process as before is repeated.
Where I encounter an error, is at the last lexeme. I'm told that globals()['contents'][0]
is out of range when called by getChar()
. I'm not expecting it to be a difficult error to find but I've tried tracing it by hand and can't seem to spot the problem. Any help would be greatly appreciated.
It is simply because after successfully reading the last 3
of input string, the digitfunc
function iterate one more time getchar
. But at that moment content
has been exhausted and is empty, so contents[0]
is passed end of buffer, hence the error.
As a workaround, if you add a newline or a space after the last character of expression, your current code does not exhibit the problem.
The reason for that is that when last char is UNKNOWN
you immediately return from lex and exit the loop because content
is empty, but if your are processing a number or a symbol you loop calling getchar
without testing end of input. By the way, if your input string ends with a right paren, your lexer eats it and forget to display that it found it.
So you should at least:
test end of input in getchar:
def getchar():
if len(contents) == 0:
# print "END OF INPUT DETECTED"
globals()['charClass'] = UNKNOWN
globals()['nextChar'] = ''
return
...
display the last token if one is left:
...
while contents:
lex()
lex()
control if a lexeme is present (weird things may happen at end of input)
...
if charClass in lexDict.keys():
lexDict[charClass]()
if lexeme != '':
print('The next token is: '+ str(globals()['nextToken']) +
' The next lexeme is: >' + globals()['lexeme'] + '<')
But your usage of globals is bad. The common idiom to use a global from within a function is to declare it before usage:
a = 5
def setA(val):
global a
a = val # sets the global variable a
But globals in Python are code smell. The best you could do is to properly encapsulate you parser in a class. Objects are better than globals