Search code examples
parsinghaskellcommentsparseccode-translation

Preserving comments in `Text.Parsec.Token` tokenizers


I'm writing a source-to-source transformation using parsec, So I have a LanguageDef for my language and I build a TokenParser for it using Text.Parsec.Token.makeTokenParser:

myLanguage = LanguageDef { ...
  commentStart = "/*"
  , commentEnd = "*/"
  ...
}

-- defines 'stringLiteral', 'identifier', etc...
TokenParser {..} = makeTokenParser myLanguage

Unfortunately since I defined commentStart and commentEnd, each of the parser combinators in the TokenParser is a lexeme parser implemented in terms of whiteSpace, and whiteSpace eats spaces as well as comments.

What is the right way to preserve comments in this situation?

Approaches I can think of:

  1. Don't define commentStart and commentEnd. Wrap each of the lexeme parsers in another combinator that grabs comments before parsing each token.
  2. Implement my own version of makeTokenParser (or perhaps use some library that generalizes Text.Parsec.Token; if so, which library?)

What's the done thing in this situation?


Solution

  • In principle, defining commentStart and commentEnd don't fit with preserving comments, because you need to consider comments as valid parts of both source and target language, including them in your grammar and your AST/ADT.

    In this way, you'd be able to keep the text of the comment as the payload data of a Comment constructor, and output it appropriately in the target language, something like

    data Statement = Comment String | Return Expression | ......
    

    The fact that neither source nor target language sees the comment text as relevant is irrelevant for your translation code.


    Major problem with this approach: It doesn't really fit well with makeTokenParser, and fits better with implementing your source language's parser from the ground up.

    I guess I'm veering towards editing makeTokenParser to just get the comment parsers to return the String instead of ().