Search code examples
scalaparsinglexical-analysisscala-2.9

Combining parsers when lexing an SGMLish document in Scala


I'm new to lexing and parsing in entirety beyond small cases. With that caveat given, my problem is that I'm trying to parse a JSP like dialect in Scala. I am lexing the char stream and when I get to a JSP like tag, I'm stuck.

Some text<%tag attribute="value"%>more stuff.

My lexer right now is attempting to pull out the tag part and tokenize, so I have something like:

def document: Parser[Token] = tag | regular

def tag: Parser[Token] = elem('<') ~ elem('%') ~ rep1(validTagName) ~ tagAttribute.* ~ elem('%') ~ elem('>') ^^ {
    case a ~ b ~ tagName ~ tagAttributes ~ c ~ d => {
        Tag(tagName.foldLeft("")(_+_)) :: tagAttributes.flatMap(_)
    }
}

def validTagName: Parser[Token] = elem("",Character.isLetter(_))  // over-simplified

... Other code for tagAttribute and Tag extends Token here

You can probably spot about a half a dozen problems right now, I know I can spot a few myself, but, this is where I'm currently at. Ultimately the token function is supposed to return a Parser, and if I understand this all correctly, a Parser can be comprised of other parsers. My thinking is that I should be able to construct a parser by combining several other Parser[Token] objects. I don't know how to do that, and I don't understand fully if that is the best way to do this.


Solution

  • It sounds like you may be mixing up your lexical and synactic parsers. If you want to go the route of writing your own lexer, you'll need two parsers, with the first extending lexical.Scanners (and therefore providing a token method of type Parser[Token]), and with the other extending syntactical.TokenParsers and referring to the first in its implementation of that trait's abstract lexical method.

    Unless you have some specific reason to write your own lexer, though, it may be easier to use something like RegexParsers:

    import scala.util.parsing.combinator._
    
    object MyParser extends RegexParsers {
      def name = "\\p{Alpha}+".r
      def value = "\"" ~> "[^\"]*".r <~ "\""
      def attr = name ~ "=" ~ value ^^ { case k ~ _ ~ v => k -> v }
    
      def tag: Parser[(String, Map[String, String])] =
        "<%" ~> name ~ rep(attr) <~ "%>" ^^ {
          case tagName ~ attrs => tagName -> attrs.toMap
        }
    }
    

    Now something like MyParser.parseAll(MyParser.tag, "<%tag attribute=\"value\"%>") works as expected.

    Note that since we're not writing a lexer, there's no obligation to provide a Parser[Token] method.