Search code examples
pythonregexcompiler-constructionlexical-analysis

How to check for a sequence of characters that do not match a Regex


I am currently trying to implement a lexical scanner that will later become part of a compiler. The program uses Regular expressions to match an input program file. If a series of non white-space characters are matched to a regex, the section of input matched is then converted into a token which, with the rest of the other tokens, will be sent to a parser. I have the code working so that the right tokens are output correctly but I need to make it so the scanner will raise an exception (called by a method no_token()) if a series of non-whitespace characters are found that do not match any of the regular expressions given. This is my first post on here so please if you have any tips on how I can improve my posts please let me know or if you require more information on the question or code please ask.

def get_token(self):
    '''Returns the next token and the part of input_string it matched.
       The returned token is None if there is no next token.
       The characters up to the end of the token are consumed.
       Raise an exception by calling no_token() if the input contains
       extra non-white-space characters that do not match any token.'''
    self.skip_white_space()
    # find the longest prefix of input_string that matches a token
    token, longest = None, ''
    for (t, r) in Token.token_regexp:
        match = re.match(r, self.input_string[self.current_char_index:])
        if match is None:
            self.no_token()
        elif match and match.end() > len(longest):
            token, longest = t, match.group()
    self.current_char_index += len(longest)
    return (token, longest)

as you can see I tried using

if match is None:
    self.no_token()

but this produces the exception and exits the program at the start of the and no tokens are returned but if I comment this out the code works fine. Obviously I need this section to produce an exception if non white-space characters do not match any regex or it will cause problems at later stages of development

The method skip_white_space() consumes all white-spaces up to the next non white-space character, the regular expressions are stored in token_regexp and self.input_string[self.current_char_index:]) gives the current char.

for the program as a .txt file:

z := 2;
if z < 3 then
  z := 1
end

without the call to no_token the output is:

ID z

BEC

NUM 2

SEM

IF

ID z

LESS

NUM 3

THEN

ID z

BEC

NUM 1

END

which is correct but when I try to implement the no_token() call i get:

lexical error: no token found at the start of z := 2;
if z < 3 then
  z := 1
end

which is what the no_token() method outputs if there is a series of characters that do not match a regex I have implemented in the scanner but this is not the case for this input. All the character sequences here are valid.


Solution

  • Got it all sorted. Cheers

    def get_token(self):
        '''Returns the next token and the part of input_string it matched.
           The returned token is None if there is no next token.
           The characters up to the end of the token are consumed.
           Raise an exception by calling no_token() if the input contains
           extra non-white-space characters that do not match any token.'''
        self.skip_white_space()
        # find the longest prefix of input_string that matches a token
        token, longest = None, ''
        for (t, r) in Token.token_regexp:
            match = re.match(r, self.input_string[self.current_char_index:])
            if match and match.end() > len(longest):
                token, longest = t, match.group()
    
        self.current_char_index += len(longest)
        if token == None and self.current_char_index < len(self.input_string):
            self.no_token()
        return (token, longest)
    

    was the final working code