Search code examples
haskellparsec

Dropping text up to a special character with Parsec


I'm new to Haskell and Parsec --- my apologies if this question is trivial.

I want to parse lines of text that are structured like this:

<Text to be dropped> <special character (say "#")> <field 1> <comma> <field 2>
<comma> <field 3> <special character 2 (say "%")> <Text to be dropped>

I want my parser to discard the "text to be dropped" at the beginning and at the end, and to keep the contents of the fields. My main problem is understanding how to write a parser that drops everything up to a certain special character.

The parsers from the library that seem helpful are anyChar, manyTill and oneOf, but I don't understand how to combine them. I would be grateful for any simple example.


Solution

  • When writing Parsec code, it is useful to first write out the grammar that you want to parse in BNF form first, because parsers written in Parsec end up very much looking like the grammar.

    Let's try that:

    line ::= garbage '#' field ',' field ',' field '%' garbage
    

    In the above production, we assume a production named garbage, whose actual definition will depend on what text you actually want dropped. Likewise, we assume a production named field. Now let's write this production out as parsec code:

    line = do
      garbage
      char '#'
      field1 <- field
      char ','
      field2 <- field
      char ','
      field3 <- field
      char '%'
      garbage
      return (field1, field2, field3)
    

    This code reads exactly like the BNF. The essential difference is that the results of some of the subproductions are named, so that we can return a value built from these results (in this case a tuple).

    Now i don't know what your notion of garbage is, but for the sake of example let's assume that you mean any whitespace. Then you could define garbage as follows:

    garbage = many space
    

    (or, alternatively, it so happens that parsec already has a combinator for parsing zero or more spaces called spaces). If the garbage could be anything except the # delimiter character, then you could say

    garbage = many (noneOf "#")
    

    This line will munch all input up to and excluding the first '#'. Either way, whatever value garbage produces as a result, since you are not binding a name to the value it will be thrown away.