Search code examples
haskellparsec

Take strings as tokens parsing with Parsec


I am using Parsec library to parse a string. The problem I have is that can't differentiate some tokens because they are words with the same prefix. Simplifying the whole grammar (in my case it's not a regular one), say we have the following:

T0 := <empty> | tag0 T1 T0
T1 := tag1 | tag1 T1

So I may have strings like "tag0tag1tag1" or like "tag0tag1tag0tag1" etc., basically we have a "tag0" string followed by an arbitrary (not zero) number of "tag1" string, and all this could also be repeated any number of times.

So what I tried was something like:

wrongParser :: Parser String
wrongParser = do
    string "tag0"
    many $ string "tag1"
    return "Ok"

And tested with

ghci> parse wrongParser "Error" "tag0tag1tag1tag0tag1"
Left "Error" (line 1, column 13):
unexpected "0"
expecting "tag1"

So what seems to happen here is that the parser read "tag" from "tag0", but it is expecting "tag1" instead (because is still reading many of "tag1").

Is there a way to make the parser to take the tag string as a whole so that instead of failing it just assumes that all the many tag1 are already read and stop with no error (maybe another function than string)? Or what is the correct way to handle this case?


Solution

  • This is a common misconception with Parsec:

    • A parser consumes char by char, by default. (these are called tokens in Parsec)
    • A failure in Parsec is a parser which comsumes less output than expected
    • Parsers do not backtrack unless you specify so.

    Therefore, your parser works like this

    string to consume: "tag0tag1tag1tag0tag1"
    string consumed: ""
    
    -- *********************
    -- Parser: string "tag0"
    -- *********************
    
    -- 1st token
    string to consume: "ag0tag1tag1tag0tag1"
    string consumed: "t"
    
    -- 2nd token
    string to consume: "g0tag1tag1tag0tag1"
    string consumed: "ta"
    
    -- 3th token
    string to consume: "0tag1tag1tag0tag1"
    string consumed: "tag"
    
    -- 4th token
    string to consume: "tag1tag1tag0tag1"
    string consumed: "tag0"
    
    -- ***********************
    -- done with string "tag0"
    -- ***********************
    
    -- ****************************
    -- Parser: many $ string "tag1" 
    -- ****************************
    
    -- convince yourself you'll reach this point
    string to consume: "tag0tag1"
    string consumed: "tag0tag1tag1"
    
    -- keep in mind we are executing (string "tag1")
    -- because you put it inside many.
    -- So you end up with 
    string to consume: "0tag1"
    string consumed: "tag0tag1tag1tag"
    
    -- after this, you consume a 0 and fail.
    string to consume: "tag1"
    string consumed: "tag0tag1tag1tag" 0 ??? 
    
    

    Probably you want to say, "if you fail reading tag1, just pretend you haven't read it at all". This is called backtracking and you use the try combinator to make it happen

    rightParser :: Parser String
    rightParser = do
      string "tag0"
      many1 $ try (string "tag1") -- many1 because tag0 must be followed by at least one tag1. try because you want to backtrack tag1 if failure.
      return "Ok"
    

    Now if you want to parse the whole string you use many rightParser. For the sake of example, let's say you want to return all "tag1"s. Then

    rightParser :: Parser [String]
    rightParser = do
      string "tag0"
      many1 $ try (string "tag1") -- same as before, by returns a list of tag1 instead of "Ok"
    
    wholeParser = many rightParser
    
    main = print $ parse wholeParser "Error" "tag0tag1tag1tag0tag1"
    
    > Right [["tag1","tag1"],["tag1"]]
    

    Notice, that other libraries (ex: Attoparsec) do always backtrack on failure. This is a design decision made by each library. Also, backtracking can be an expensive operation, so you may want to write your parser differently (example: always parse "tag" and backtrack only on "0" or "1")