Search code examples
parsinghaskellparsec

Make a parser ignore all redundant whitespace


Say I have a Parser p in Parsec and I want to specify that I want to ignore all superfluous/redundant white space in p. Let's for example say that I define a list as starting with "[", end with "]", and in the list are integers separated by white space. But I don't want any errors if there are white space in front of the "[", after the "]", in between the "[" and the first integer, and so on.

In my case, I want this to work for my parser for a toy programming language.

I will update with code if that is requested/necessary.


Solution

  • Use combinators to say what you mean:

    import Control.Applicative
    import Text.Parsec
    import Text.Parsec.String
    
    program :: Parser [[Int]]
    program = spaces *> many1 term <* eof
    
    term :: Parser [Int]
    term = list
    
    list :: Parser [Int]
    list = between listBegin listEnd (number `sepBy` listSeparator)
    
    listBegin, listEnd, listSeparator :: Parser Char
    listBegin = lexeme (char '[')
    listEnd = lexeme (char ']')
    listSeparator = lexeme (char ',')
    
    lexeme :: Parser a -> Parser a
    lexeme parser = parser <* spaces
    
    number :: Parser Int
    number = lexeme $ do
      digits <- many1 digit
      return (read digits :: Int)
    

    Try it out:

    λ :l Parse.hs
    Ok, modules loaded: Main.
    λ parseTest program " [1, 2, 3] [4, 5, 6] "
    [[1,2,3],[4,5,6]]
    

    This lexeme combinator takes a parser and allows arbitrary whitespace after it. Then you only need to use lexeme around the primitive tokens in your language such as listSeparator and number.

    Alternatively, you can parse the stream of characters into a stream of tokens, then parse the stream of tokens into a parse tree. That way, both the lexer and parser can be greatly simplified. It’s only worth doing for larger grammars, though, where maintainability is a concern; and you have to use some of the lower-level Parsec API such as tokenPrim.