Search code examples
haskellmegaparsec

Megaparsec: skip space and non-alphanumeric


I'm a beginner with Megaparsec and Haskell in general, and trying to write a parser for the following grammar:

A word will always be one of:

  1. A number composed of one or more ASCII digits (ie "0" or "1234") OR
  2. A simple word composed of one or more ASCII letters (ie "a" or "they") OR
  3. A contraction of two simple words joined by a single apostrophe (ie "it's" or "they're")

So far, I've got the following (this can probably be simplified):

data Word = Number String | SimpleWord String | Contraction String deriving (Show)

word :: Parser MyParser.Word
word = M.choice
  [ Number <$> number
  , Contraction <$> contraction
  , SimpleWord <$> simpleWord
  ]

number :: Parser String
number = M.some C.numberChar

simpleWord :: Parser String
simpleWord = M.some C.letterChar

contraction :: Parser String
contraction = do
  left <- simpleWord
  void $ C.char '\''
  right <- simpleWord
  return (left ++ "'" ++ right)

But I'm having problem with defining a parser for skipping white spaces and anything that is non-alphanumeric. For example, given the input 'abc', the parser should discard the apostrophes and just take the "simple word". The following doesn't compile:

filler :: Parser Char
filler = M.some (C.spaceChar  A.<|> not C.alphaNumChar)

spaceConsumer :: Parser ()
spaceConsumer = L.space filler A.empty A.empty

lexeme :: Parser a -> Parser a
lexeme = L.lexeme spaceConsumer

Solution

  • Here is the complete working code that I came up with.

    type Parser =
      M.Parsec
        -- The type for custom error messages. We have none, so use `Void`.
        Void
        -- The input stream type. Let's use `String` for now.
        String
    data Word = Number String | SimpleWord String | Contraction String deriving (Eq)
    instance Show WordCount.Word where
      show (Number x) = x
      show (SimpleWord x) = x
      show (Contraction x) = x
    words :: String -> Either String [String]
    -- Force parser to consume entire input
    -- <* Sequence actions, discarding the value of the second argument.
    words input = case M.parse (M.some WordCount.word A.<* M.eof) "" input of
      -- :t err = M.ParseErrorBundle String Void
      Left err ->
        let e = M.errorBundlePretty err
            _ = putStr e
         in Left e
      Right (x) -> Right $ map (show) x
    word :: Parser WordCount.Word
    word =
      M.skipManyTill filler $
        lexeme $
          M.choice
            -- <$> is infix for 'fmap'
            [ Number <$> number,
              Contraction <$> M.try contraction,
              SimpleWord <$> simpleWord
            ]
    number :: Parser String
    number = M.some MC.numberChar
    simpleWord :: Parser String
    simpleWord = M.some MC.letterChar
    contraction :: Parser String
    contraction = do
      left <- simpleWord
      void $ MC.char '\''
      right <- simpleWord
      return $ left ++ "'" ++ right
    -- Define separator characters
    isSep :: Char -> Bool
    isSep x = C.isSpace x || (not . C.isAlphaNum) x
    -- Fillers fill the space between tokens
    filler :: Parser ()
    filler = void $ M.some $ M.satisfy isSep
    -- 3rd and 4th arguments are for ignoring comments
    spaceConsumer :: Parser ()
    spaceConsumer = L.space filler A.empty A.empty
    -- A parser that discards trailing space
    lexeme :: Parser a -> Parser a
    lexeme = L.lexeme spaceConsumer