Search code examples
pythoncompiler-constructionlexical-analysis

How can i ignore comments in a string based on compiler design?


I want to ignore every comment like { comments } and // comments. I have a pointer named peek that checks my string character by character. I know how to ignore newlines, tabs, and spaces but I don't know how to ignore comments.

string =  """  beGIn west   WEST north//comment1 \n
north       north west East east south\n
// comment west\n
{\n
    comment\n
}\n end
"""

tokens = []
tmp = ''

for i, peek in enumerate(string.lower()):
    if peek == ' ' or peek == '\n':
        tokens.append(tmp)
        # ignoing WS's and comments
        if(len(tmp)>0): 
            print(tmp)

        tmp = ''
    
    else:
        tmp += peek

Here is my result:

begin
west
west
north//
comment1
north
north
west
east
east
south
{
comment2
}
end

As you see spaces are ignored but comments aren't.

How can I get a result like below?

begin
west
west
north
north
north
west
east
east
south
end

Solution

  • Simply use global variable skip = False and set it True when you get { and set False when you get } and the rest of your if/else run in if not skip:

    string =  """  beGIn west   WEST north//comment1 \n
    north       north west East east south\n
    // comment west\n
    {\n
        comment\n
    }\n end
    """
    
    tokens = []
    tmp = ''
    skip = False
    
    for i, peek in enumerate(string.lower()):
    
        if peek == '{':
            skip = True
        elif peek == '}':
            skip = False
        elif not skip:
    
            if peek == ' ' or peek == '\n':
                tokens.append(tmp)
                # ignoing WS's and comments
                if(len(tmp)>0): 
                    print(tmp)
                tmp = ''
            else:
                tmp += peek
    

    Because you may have nested { { } } like

    {\n
        { comment1 }\n
        comment2\n
        { comment3 }\n
    }\n
    

    so better use skip to count { }

    string =  """  beGIn west   WEST north//comment1 \n
    north       north west East east south\n
    // comment west\n
    {\n
        { comment1 }\n
        comment2\n
        { comment3 }\n
    }\n end
    """
    
    tokens = []
    tmp = ''
    skip = 0
    
    for i, peek in enumerate(string.lower()):
    
        if peek == '{':
            skip += 1
        elif peek == '}':
            skip -= 1
        elif not skip:  # elif skip == 0:
    
            if peek == ' ' or peek == '\n':
                tokens.append(tmp)
                # ignoing WS's and comments
                if(len(tmp)>0): 
                    print(tmp)
                tmp = ''
            else:
                tmp += peek
    

    But maybe it would be better to get all as tokens and later filter tokens. But I skip this idea.


    EDIT:

    Version using Python module sly which works similar to C/C++ tools lex/yacc

    Regex for MULTI_LINE_COMMENT I found in other tool for building parsers - lark:

    syntax for multiline comments

    from sly import Lexer, Parser
    
    class MyLexer(Lexer):
        # Create it befor defining regex for Tokens
        tokens = { NAME, ONE_LINE_COMMENT, MULTI_LINE_COMMENT }
    
        ignore = ' \t'
    
        # Tokens
        NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'
        ONE_LINE_COMMENT = '\/\/.*'
        MULTI_LINE_COMMENT = '{(.|\n)*}'
    
        # Ignored pattern
        ignore_newline = r'\n+'
    
        # Extra action for newlines
        def ignore_newline(self, t):
            self.lineno += t.value.count('\n')
    
        # Work with errors
        def error(self, t):
            print("Illegal character '%s'" % t.value[0])
            self.index += 1
    
    if __name__ == '__main__':
        
        text =  """  beGIn west   WEST north//comment1 
    north       north west East east south
    // comment west
    {
        { comment1 }
        comment2
        { comment3 }
    }
     end
    """
        
        lexer = MyLexer()
        tokens = lexer.tokenize(text)
        for item in tokens:
            print(item.type, ':', item.value)
    

    Result:

    NAME : beGIn
    NAME : west
    NAME : WEST
    NAME : north
    ONE_LINE_COMMENT : //comment1 
    NAME : north
    NAME : north
    NAME : west
    NAME : East
    NAME : east
    NAME : south
    ONE_LINE_COMMENT : // comment west
    MULTI_LINE_COMMENT : {
        { comment1 }
        comment2
        { comment3 }
    }
    NAME : end