Search code examples
parsinghaskellfunctional-programmingparsec

Using Parsec to parse a string containing a list of tuples


I am trying to parse (using parsec) a string that represents some data type that I defined. Thus the string needs to be parsed to my data type. An example of the string would be,

[(1,[(<,0),(%,4)]), (2,[(>=, 4)])]

This would parse to the following,

[(Reg 1, [Cmp (Jlt, Intv (0, 0)), Op (Mod, Intv (-4,4))]), (Reg 2, [Cmp (Jge, (4,4))])]

Now this makes use of a few custom data types,

newtype Reg = Reg Int deriving (Eq, Show, Ord)
data LF = Op (BinAlu, Interval) | Cmp (Jcmp, Interval) | Invalid
    deriving (Eq, Show, Ord)
data BinAlu
  = Add
  | Sub
  | Mul
  | Div
  | Or
  | And
  | Lsh
  | Rsh
  | Mod
  | Xor
  | Mov
  | Arsh
  deriving (Eq, Show, Ord, Enum)
data Jcmp = Jeq | Jgt | Jge | Jlt | Jle | Jset | Jne | Jsgt | Jsge | Jslt | Jsle
  deriving (Eq, Show, Ord, Enum)
data Interval = Bot | Intv (Int, Int)
  deriving (Eq, Show, Ord)

Thus I want to parse the string to be the following type [(Reg, [LF])]

Now I am quite lost on how to actually do this. I think I have an idea, but I find it difficult implementing that idea.

My idea is to first use, between (symbol "[") (symbol "]"), which hopyfully would give me the contents between [ and ]. Then I need to do something similar for the paranthesis but repeat it. And then of course parsing the contents within the paranthesis.

I am basically looking for any advice on how to setup this parser. And how in general to structure such a parser.

Any help is greatly appreciated!


Solution

  • The following should get you started. We'll need some imports:

    module TupleParser where
    import Text.Parsec
    import Text.Parsec.Char
    import Text.Parsec.String
    

    In order to properly process whitespace, you should begin by writing some combinators to handle "lexemes", parsers that expect to start at a non-whitespace character, parse something, and discard trailing whitespace. While Parsec has some lexeme support in Text.Parsec.Token, it's over-designed and hard to use. Here's a simplified alternative, based on the Megaparsec approach:

    -- a lexeme starts on non-whitespace, parses something,
    -- and discards trailing whitespace
    lexeme :: Parser a -> Parser a
    lexeme p = p <* spaces
    
    -- a symbol is a verbatim string, treated as a lexeme
    symbol :: String -> Parser String
    symbol s = lexeme (string s)
    

    The following are pretty standard lexemes for parsing numbers:

    -- an unsigned decimal number, treated as a lexeme
    decimal :: (Read n, Integral n) => Parser n
    decimal = lexeme (read <$> many1 digit)
    
    -- combinator for signed numbers; replace "string" with
    -- "symbol" if you want to allow space between dash and
    -- first digit
    signed :: (Num n) => Parser n -> Parser n
    signed p = option id (negate <$ string "-") <*> p
    

    And here are some other pretty standard lexemes/combinators:

    -- some standard names
    comma :: Parser String
    comma = symbol ","
    
    parens :: Parser p -> Parser p
    parens = between (symbol "(") (symbol ")")
    
    brackets :: Parser p -> Parser p
    brackets = between (symbol "[") (symbol "]")
    

    Here's a helper for lists, since you use it in a couple places.

    -- a list is a bracket-delimited, comma-separated list
    listOf :: Parser p -> Parser [p]
    listOf p = brackets (p `sepBy` comma)
    

    Now, we should define the lowest-level "atoms" of your grammar:

    -- (insert your data types here)
    
    reg :: Parser Reg
    reg = Reg <$> decimal
    
    lf :: Parser LF
    lf = parens
      $   Op <$> ((,) <$> binalu <* comma <*> interval)
      <|> Cmp <$> ((,) <$> jcmp <* comma <*> interval)
      <|> Invalid <$ symbol "???"
    
    -- I don't really understand your interval syntax, so
    -- I'm just parsing any number "n" into "Intv (n,n)"
    interval :: Parser Interval
    interval = (\x -> Intv (x,x)) <$> signed decimal
    

    For binalu and jcmp, a simple first attempt might look like this:

    binalu :: Parser BinAlu
    binalu
      =   Mod <$ symbol "%"
      -- etc.
    
    jcmp :: Parser Jcmp
    jcmp
      =   Jlt <$ symbol "<"
      <|> Jge <$ symbol ">="
      -- etc.
    

    and this is sufficient to parse your example input. However, there's a problem here when you flesh these out with all your desired operators. The parser symbol "<" for example, will happily parse the first character of "<=", leaving the "=" to cause an error when you try to parse the comma next. If you order the alternatives to try "<=" first:

    jcmp :: Parser Jcmp
    jcmp
      =   Jle <$ symbol "<="
      <|> Jlt <$ symbol "<"
      -- etc.
    

    this still isn't sufficient, because symbol "<=" will happily start parsing a "<" not followed by a "=" and then "fail after consuming input", which prevents any later alternatives from being tried. You can use the try combinator to continue anyway:

    jcmp :: Parser Jcmp
    jcmp
      =   try (Jle <$ symbol "<=")
      <|> Jlt <$ symbol "<"
      -- etc.
    

    but this is tedious to get right. The usual solution is to define a list of "operator characters":

    -- include every character the appears in one of your operators
    opChars :: String
    opChars = "+-*/|&<=>%^!" 
    

    and define an operator combinator (note: Parsec calls this combinator reservedOp) that parses an operator followed by something other than an operator character:

    operator :: String -> Parser String
    operator s = lexeme $ try (string s <* notFollowedBy (oneOf opChars))
    

    Now, you can list operators in any order, and they'll work fine:

    jcmp :: Parser Jcmp
    jcmp
      =   Jle <$ operator "<="
      <|> Jlt <$ operator "<"
      <|> Jgt <$ operator ">"
      <|> Jge <$ operator ">="
      -- etc.
    

    Finally, we can define the grammar for your higher-level structures. Note that the topmost parser should ignore leading whitespace, as all the lexeme parsers expect to start with non-whitespace, and check for end-of-input.

    type Program = [Statement]
    type Statement = (Reg, [LF])
    
    program :: Parser Program
    program = spaces *> listOf statement <* eof
    
    statement :: Parser Statement
    statement = parens $ (,) <$> reg <* comma <*> listOf lf
    

    Here's a test on your proposed input:

    main = parseTest program "[(1,[(<,0),(%,4)]), (2,[(>=, 4)])]"
    

    which should produce the output:

    [(Reg 1,[Cmp (Jlt,Intv (0,0)),Op (Mod,Intv (4,4))]),(Reg 2,[Cmp (Jge,Intv (4,4))])]
    

    Full code:

    module TupleParser where
    
    import Text.Parsec
    import Text.Parsec.Char
    import Text.Parsec.String
    
    lexeme :: Parser a -> Parser a
    lexeme p = p <* spaces
    
    symbol :: String -> Parser String
    symbol s = lexeme (string s)
    
     -- characters appearing in operators
    opChars :: String
    opChars = "+-*/|&<=>%^!"
    
    -- parse an operator
    operator :: String -> Parser String
    operator s = lexeme $ try (string s <* notFollowedBy (oneOf opChars))
    
    decimal :: (Read n, Integral n) => Parser n
    decimal = lexeme (read <$> many1 digit)
    
    signed :: (Num n) => Parser n -> Parser n
    signed p = option id (negate <$ string "-") <*> p
    
    comma :: Parser String
    comma = symbol ","
    
    parens :: Parser p -> Parser p
    parens = between (symbol "(") (symbol ")")
    
    brackets :: Parser p -> Parser p
    brackets = between (symbol "[") (symbol "]")
    
    listOf :: Parser p -> Parser [p]
    listOf p = brackets (p `sepBy` comma)
    
    newtype Reg = Reg Int deriving (Eq, Show, Ord)
    
    data LF = Op (BinAlu, Interval) | Cmp (Jcmp, Interval) | Invalid
        deriving (Eq, Show, Ord)
    
    data BinAlu
      = Add
      | Sub
      | Mul
      | Div
      | Or
      | And
      | Lsh
      | Rsh
      | Mod
      | Xor
      | Mov
      | Arsh
      deriving (Eq, Show, Ord, Enum)
    
    data Jcmp = Jeq | Jgt | Jge | Jlt | Jle | Jset | Jne | Jsgt | Jsge | Jslt | Jsle
      deriving (Eq, Show, Ord, Enum)
    
    data Interval = Bot | Intv (Int, Int)
      deriving (Eq, Show, Ord)
    
    reg :: Parser Reg
    reg = Reg <$> decimal
    
    lf :: Parser LF
    lf = parens
      $   Op <$> ((,) <$> binalu <* comma <*> interval)
      <|> Cmp <$> ((,) <$> jcmp <* comma <*> interval)
      <|> Invalid <$ symbol "???"
    
    binalu :: Parser BinAlu
    binalu
      =   Mod <$ operator "%"
      -- etc.
    
    jcmp :: Parser Jcmp
    jcmp
      =   Jlt <$ operator "<"
      <|> Jge <$ operator ">="
      -- etc.
    
    -- I don't really understand your interval syntax, so
    -- I'm just parsing any number "n" into "Intv (n,n)"
    interval :: Parser Interval
    interval = (\x -> Intv (x,x)) <$> signed decimal
    
    type Program = [Statement]
    type Statement = (Reg, [LF])
    
    program :: Parser Program
    program = spaces *> listOf statement <* eof
    
    statement :: Parser Statement
    statement = parens $ (,) <$> reg <* comma <*> listOf lf
    
    main = parseTest program "[(1,[(<,0),(%,4)]), (2,[(>=, 4)])]"