I'm trying to create a parser (with parsec), that parses tokens, delimited by newlines, commas, semicolons and unicode dashes (ndash and mdash):
authorParser = do
name <- many1 (noneOf [',', ':', '\r', '\n', '\8212', '\8213'])
many (char ',' <|> char ':' <|> char '-' <|> char '\8212' <|> char '\8213')
But the ndash-mdash (\8212, \8213) part never 'succeeds' and i'm getting invalid parse results.
How do i specify unicode dashes with char parser?
P.S. I've tried (chr 8212), (chr 8213) too. It doesn't helps.
ADDITION: It is better to use Data.Text. The switch from ByteStrings madness to Data.Text saved me a lot of time and 'source space' :)
Works for me:
Prelude Text.ParserCombinators.Parsec> let authorName = do { name <- many1 (noneOf ",:\r\n\8212\8213"); many (oneOf ",:-\8212\8213"); }
Prelude Text.ParserCombinators.Parsec> parse authorName "" "my Name,\8212::-:\8213,"
Right ",\8212::-:\8213,"
How did you try?
The above was using plain String
, which works without problems because a Char
is a full uncode code point. It's not as nice with other types of input stream. Text
will probably also work well for this example, I think that the dashes are encoded as a single code unit there. For ByteString
, however, things are more complicated. If you're using plain Data.ByteString.Char8
(strict or lazy, doesn't matter), the Char
s get truncated on packing, only the least significant 8 bits are retained, so '\8212' becomes 20 and '\8213' becomes 21. If the input stream is constructed the same way, that still kind of works, only all Char
s congruent to 20 or 21 modulo 256 will be mapped to the same as one of the dashes.
However, it is likely that the input stream is UTF-8 encoded, then the dashes are encoded as three bytes each, "\226\128\148" resp. "\226\128\149", which doesn't match what you get by truncating. Trying to parse utf-8 encoded text with ByteString
and parsec
is a bit more involved, the units of which the parse result is composed are not single bytes, but sequences of bytes, 1-4 in length each.
To use noneOf
, you need an
instance Text.Parsec.Prim.Stream ByteString m Char
which does the right thing. The instance provided in Text.Parsec.ByteString[.Lazy]
doesn't, it uses the Data.ByteString[.Lazy].Char8
interface, so an en-dash would become a single '\20' not matching '\8212' or produce three Chars
, '\226', '\128' and '\148' in three successive calls to uncons
, none of which matches '\8212' either, depending on how the input was encoded.