Search code examples
parsingscalaindentationparser-combinators

Parsing an indentation based language using scala parser combinators


Is there a convenient way to use Scala's parser combinators to parse languages where indentation is significant? (e.g. Python)


Solution

  • Let's assume we have a very simple language where this is a valid program

    block
      inside
      the
      block
    

    and we want to parse this into a List[String] with each line inside the block as one String.

    We first define a method that takes a minimum indentation level and returns a parser for a line with that indentation level.

    def line(minIndent:Int):Parser[String] = 
      repN(minIndent + 1,"\\s".r) ~ ".*".r ^^ {case s ~ r => s.mkString + r}
    

    Then we define a block with a minimum indentation level by repeating the line parser with a suitable separator between lines.

    def lines(minIndent:Int):Parser[List[String]] =
      rep1sep(line(minIndent), "[\n\r]|(\n\r)".r)
    

    Now we can define a parser for our little language like this:

    val block:Parser[List[String]] =
      (("\\s*".r <~ "block\\n".r) ^^ { _.size }) >> lines
    

    It first determines the current indentation level and then passes that as the minimum to the lines parser. Let's test it:

    val s =
    """block
        inside
        the
        block
    outside
    the
    block"""
    
    println(block(new CharSequenceReader(s)))
    

    And we get

    [4.10] parsed: List(    inside,     the,     block)
    

    For all of this to compile, you need these imports

    import scala.util.parsing.combinator.RegexParsers
    import scala.util.parsing.input.CharSequenceReader
    

    And you need to put everything into an object that extends RegexParsers like so

    object MyParsers extends RegexParsers {
      override def skipWhitespace = false
      ....