Search code examples
haskellparsec

Haskell Parsec skip all words that arent predefined


Learning to use the Parsec library, part of homework.

EDIT: Suggestions to use other libraries are welcome, the point is the parsing.

What I want, is to extract all words with a capital letter, and four compass directions from any sentence. Example: "Belgium totally lies south of Holland." should find and return "Belgium south Holland".

What I can't figure is how to ignore (eat) any input that is -not- a compass direction. I was looking to find something along the lines of

'many (not compassDirection >> space)'

but g(h)oogle isn't helping me.

Following code is obviously stuck on the 'many' function.

readExpr :: String -> String
readExpr input = case parse (parseLine) "" input of
    Left err -> "No match: " ++ show err
    Right val -> "Found: " ++ showVal val

parseLine :: Parser GraphValue
parseLine = do
            x <- parseCountry
            space
            many ( some (noneOf " ") >> space )
            y <- parseCompass
            space
            many ( some (noneOf " ") >> space )
            z <- parseCountry
            return $ Direction [x,y,z]

compassDirection :: Parser String
compassDirection = string "north" <|>
                   string "south" <|>
                   string "east" <|>
                   string "west"

parseCountry :: Parser GraphValue
parseCountry = do 
                c <- upper 
                x <- many (lower)
                return $ Country (c:x)

parseCompass :: Parser GraphValue
parseCompass = do 
                x <- compassDirection
                return $ Compass x

Solution

  • I won't go into specifics since this is homework and the OP said the "important thing is the parsing".


    The way I'd solve this problem:

    • tokenize the input. Break it into words; this will free the real parsing step from having to worry about token definitions (i.e. "is %#@[ part of a word?") or whitespace. This could be as simple as words or you could use Parsec for the tokenization. Then you'll have [Token] (or [String] if you prefer).

    • a parser for compass directions. You already have this (good job), but it'll have to modified a bit if the input is [String] instead of String.

    • a parser for words that start with a capital letter.

    • a parser for everything else, that succeeds whenever it sees a token that isn't a compass direction or a word starting with a caps.

    • a parser that works on any token, but distinguishes between good stuff and bad stuff, perhaps using an algebraic data type.

    • a parser that works on lots of tokens

    Hopefully that's clear without being too clear; you'll still have to worry about when to discard the junk, for example. The basic idea is to break the problem down into lots of little sub-problems, solve the sub-problems, then glue those solutions together.