Search code examples
parsinghaskellfunctional-programmingghci

Parsing escape characters when creating parser from scratch in Haskell


I have created the code below that is part of building a parser from scratch. I do however encounter unexpected output when using escape characters similar described here ,although my output is different as follows when using ghci:

ghci> parseString "'\\\\'"
[(Const (StringVal "\\"),"")]
ghci> parseString "'\\'"
[]
ghci> parseString "'\\\'"
[]    
ghci> parseString "\\\"   

<interactive>:50:18: error:
    lexical error in string/character literal at end of input
ghci> parseString "\\" 
[]
ghci> parseString "\\\\"
[]

where as seen I get an expected output when parsing '\\\\' but not when parsing just '\\' (as in case of the link referenced above), where I would have expected [(Const (StringVal "\"),"")] as a result.Is this something that is wrong in my code or is it due to ghci, and how can I overcome it if it is the latter?

import Data.Char
import Text.ParserCombinators.ReadP
import Control.Applicative ((<|>))

type ParseError = String

type Parser a = ReadP a 

space :: Parser Char
space = satisfy isSpace

spaces :: Parser String 
spaces = many space


token :: Parser a -> Parser a
token combinator = spaces >> combinator


parseString input = readP_to_S (do 
                        e <- pExp
                        token eof
                        return e) input                 

pExp :: Parser Exp 
pExp = (do 
       pv <- stringConst
       return pv)

pStr :: Parser String
pStr = 
        (do 
        string "'"
        str <- many rightChar
        string "'"
        return str)

rightChar :: Parser Char
rightChar = (do 
                nextChar <- get
                case nextChar of 
                    '\\' -> (do ch <- (rightChar'); return ch)
                    _ -> return 'O' --nextChar
            )

rightChar' :: Parser Char 
rightChar' = (do 
                nextChar <- get
                case nextChar of
                    '\\' -> return nextChar 
                    'n' -> return '\n'
                    _ -> return 'N')

stringConst :: Parser Exp
stringConst =                           
             (do
                str <- pStr
                return (Const (StringVal str)))

Solution

  • You need to keep in mind that the internal representation of a string differs from the characters that GHCi (or even just GHC) reads from string literals in source code and what GHCi prints as output when you show (or print) the string.

    The string literal "\\" in Haskell program text, when parsed and read by GHC, creates a string consisting of a single character, a backslash. When you print this string, it appears on the console as "\\", but it's still a string consisting of a single backslash character. When you say you expect the output at the GHCi prompt to include the string literal "\", that's nonsense. There is no such string. There is no internal representation of a string that, when displayed by GHCi, would result in the three characters ", \ and " being displayed on your screen, in much the same way there is no string that would be printed as "hello with no closing double quote.

    In your first test case:

    ghci> parseString "'\\\\'"
    

    you are supplying your parser with a four character string -- single quote, backslash, backslash, single quote. If this string had been read from a file, rather than typed in at the GHCi prompt, it would have been the literal four-character program text:

    '\\'
    

    Presumably, you want your parser to parse this as a single-character string consisting of a backslash. The output from your parse:

    [(Const (StringVal "\\"),"")]
    

    shows that your parser worked. The string as displayed on the screen "\\" represents a single-character string consisting of a backslash, which is what you wanted.

    For your next case:

    ghci> parseString "'\\'"
    

    you are supplying your parser with the three character string:

    '\'
    

    Presumably, this is a parse error, as you appear to have escaped your closing single quote, meaning that this string is not terminated. Your parser correctly fails to parse it.

    For your third test case:

    ghci> parseString "'\\\'"
    

    you have passed the same three character string to your parser:

    '\'
    

    The third backslash in your string literal is processed by GHCi as escaping the closing single quote. It is unnecessary but perfectly legal.

    Your final test case:

    ghci> parseString "\\\"
    

    is syntactically invalid Haskell. The third backslash escapes the closing double quote, making it part of the string, and now your string is unterminated, as if you'd written:

    ghci> parseString "ab