Search code examples
pythonparsinganalyzerlexical

Writing Sebesda's lexical analyzer in python. Does not work for last lexeme in the input file


I have to translate lexical analyzer the code in Sebesda's Concpets of Programming Languages (chapter 4, section 2) to python. Here's what I have so far:

# Character classes #
LETTER = 0
DIGIT = 1
UNKNOWN = 99

# Token Codes #
INT_LIT = 10
IDENT = 11
ASSIGN_OP = 20
ADD_OP= 21
SUB_OP = 22
MULT_OP = 23
DIV_OP = 24
LEFT_PAREN = 25
RIGHT_PAREN = 26

charClass = ''
lexeme = ''
lexLen = 0
token = ''
nextToken = ''

### lookup - function to lookup operators and parentheses ###
###          and return the token                         ###
def lookup(ch):
    def left_paren():
        addChar()
        globals()['nextToken'] = LEFT_PAREN

    def right_paren():
        addChar()
        globals()['nextToken'] = RIGHT_PAREN

    def add():
        addChar()
        globals()['nextToken'] = ADD_OP

    def subtract():
        addChar()
        globals()['nextToken'] = SUB_OP

    def multiply():
        addChar()
        globals()['nextToken'] = MULT_OP

    def divide():
        addChar()
        globals()['nextToken'] = DIV_OP
    options = {')': right_paren, '(': left_paren, '+': add,
               '-': subtract, '*': multiply , '/': divide}

    if ch in options.keys():
        options[ch]()
    else:
        addChar()

### addchar- a function to add next char to lexeme ###
def addChar():
    #lexeme = globals()['lexeme']
    if(len(globals()['lexeme']) <=98):
        globals()['lexeme'] += nextChar
    else:
        print("Error. Lexeme is too long")

### getChar- a function to get the next Character of input and determine its character class ###
def getChar():
    globals()['nextChar'] = globals()['contents'][0]
    if nextChar.isalpha():
        globals()['charClass'] = LETTER
    elif nextChar.isdigit():
        globals()['charClass'] = DIGIT
    else:
        globals()['charClass'] = UNKNOWN
    globals()['contents'] = globals()['contents'][1:]


## getNonBlank() - function to call getChar() until it returns a non whitespace character ##
def getNonBlank():
    while nextChar.isspace():
        getChar()

## lex- simple lexical analyzer for arithmetic functions ##
def lex():
    globals()['lexLen'] = 0
    getNonBlank()
    def letterfunc():
        globals()['lexeme'] = ''
        addChar()
        getChar()
        while(globals()['charClass'] == LETTER or globals()['charClass'] == DIGIT):
            addChar()
            getChar()
        globals()['nextToken'] = IDENT

    def digitfunc():
        globals()['lexeme'] = ''
        addChar()
        getChar()
        while(globals()['charClass'] == DIGIT):
            addChar()
            getChar()
        globals()['nextToken'] = INT_LIT

    def unknownfunc():
        globals()['lexeme'] = ''
        lookup(nextChar)
        getChar()

    lexDict = {LETTER: letterfunc, DIGIT: digitfunc, UNKNOWN: unknownfunc}
    if charClass in lexDict.keys():
        lexDict[charClass]()
    print('The next token is: '+ str(globals()['nextToken']) + ' The next lexeme is: ' + globals()['lexeme'])

with open('input.txt') as input:
    contents = input.read()
    getChar()
    lex()
    while contents:
        lex()

I'm using the string sum + 1 / 33 as my sample input string. From what I understand, the first call to getChar() at the top level sets the characterClass to LETTER andcontents to um + 1 / 33.

The program then enters the while loop and calls lex(). lex() in turn accumulates the word sum in to lexeme. When the while loop inside letterfunc encounters the first white-space character, it breaks, exiting lex()

Since contents is not empty, the program goes through the while loop at the top level again. This time, the getNonBlank() call inside lex() "throws out the spaces in contents and the same process as before is repeated.

Where I encounter an error, is at the last lexeme. I'm told that globals()['contents'][0] is out of range when called by getChar(). I'm not expecting it to be a difficult error to find but I've tried tracing it by hand and can't seem to spot the problem. Any help would be greatly appreciated.


Solution

  • It is simply because after successfully reading the last 3 of input string, the digitfunc function iterate one more time getchar. But at that moment content has been exhausted and is empty, so contents[0] is passed end of buffer, hence the error.

    As a workaround, if you add a newline or a space after the last character of expression, your current code does not exhibit the problem.

    The reason for that is that when last char is UNKNOWN you immediately return from lex and exit the loop because content is empty, but if your are processing a number or a symbol you loop calling getchar without testing end of input. By the way, if your input string ends with a right paren, your lexer eats it and forget to display that it found it.

    So you should at least:

    • test end of input in getchar:

      def getchar():
          if len(contents) == 0:
              # print "END OF INPUT DETECTED"
              globals()['charClass'] = UNKNOWN
              globals()['nextChar'] = ''
              return
          ...
      
    • display the last token if one is left:

      ...
      while contents:
          lex()
      lex()
      
    • control if a lexeme is present (weird things may happen at end of input)

      ...
      if charClass in lexDict.keys():
          lexDict[charClass]()
      if lexeme != '':
          print('The next token is: '+ str(globals()['nextToken']) +
                ' The next lexeme is: >' + globals()['lexeme'] + '<')
      

    But your usage of globals is bad. The common idiom to use a global from within a function is to declare it before usage:

    a = 5
    
    def setA(val):
        global a
        a = val   # sets the global variable a
    

    But globals in Python are code smell. The best you could do is to properly encapsulate you parser in a class. Objects are better than globals