Search code examples
scalaparser-combinators

How to parse until a token is found on a line by itself


I'm trying to parse the following document:

val doc = """BEGIN
A Bunch
Of Text
With linebreaks
##
"""

The idea here being that when I see a ## on a line of its own, I should consider that the end of parsing.

I've tried, using the following code to parse this document:

object MyParser extends RegexParsers {
    val begin: Parser[String] = "BEGIN"
    val lines: Parser[Seq[String]] = repsep(line, eol)
    val line: Parser[String] = """.+""".r
    val eol: Parser[Any] = "\n" | "\r\n" | "\r"
    val end: Parser[String] = "##"

    val document: Parser[Seq[String]] = 
      begin ~> lines <~ end 

}

MyParser.parseAll(MyParser.document, doc)

However when I try to execute this (in an Annonite script), I get the following:

java.lang.NullPointerException
  scala.util.parsing.combinator.Parsers$class.rep1sep(Parsers.scala:771)
  ammonite.$file.vtt$minusparser$MyParser$.rep1sep(vtt-parser.sc:3)
  scala.util.parsing.combinator.Parsers$class.repsep(Parsers.scala:687)
  ammonite.$file.vtt$minusparser$MyParser$.repsep(vtt-parser.sc:3)
  ammonite.$file.vtt$minusparser$MyParser$.<init>(vtt-parser.sc:5)
  ammonite.$file.vtt$minusparser$MyParser$.<clinit>(vtt-parser.sc)
  ammonite.$file.vtt$minusparser$.<init>(vtt-parser.sc:22)
  ammonite.$file.vtt$minusparser$.<clinit>(vtt-parser.sc)

Can anyone see where I'm going wrong?


Solution

  • The reason for the error is that line and eol are defined as normal class field vals, but they are used in lines before their definition. The code that assigns values to class fields is executed sequentially in the constructor, and line and eol are both still null, when lines is being assigned.

    To solve this define line and eol as lazy vals or defs, or just put them before lines in the code.


    The parser itself also has some problems. By default Scala parsers automatically ignore all whitespace, including EOLs. Considering that regexp .* without any flags does not include EOLs, line naturally means "the whole line until the line break", so you don't have to analyze EOLs at all.

    Secondly, the lines parser as defined is greedy. It will happily consume everything including the final ##. To make it stop before end you can, for example, use the not combinator.

    With all the changes, the parser looks like this:

    object MyParser extends RegexParsers {
      val begin: Parser[String] = "BEGIN"
      val line: Parser[String] = """.+""".r
      val lines: Parser[Seq[String]] =  rep(not(end) ~> line)
      val end: Parser[String] = "##"
    
      val document: Parser[Seq[String]] =
        begin ~> lines <~ end
    }
    

    You may also override the behaviour of skipping the whitespace, and analyze all whitespace manually. This includes the whitespace after BEGIN and after the ##:

    object MyParser extends RegexParsers {
      override def skipWhitespace = false
    
      val eol: Parser[Any] = "\n" | "\r\n" | "\r"
      val begin: Parser[String] = "BEGIN" <~ eol
      val line: Parser[String] = """.*""".r
      val lines: Parser[Seq[String]] =  rep(not(end) ~> line <~ eol)
      val end: Parser[String] = "##"
    
      val document: Parser[Seq[String]] =
        begin ~> lines <~ end <~ whiteSpace
    
    }
    

    Note, that line is defined as .* instead of .+ here. Like this the parser won't fail if there're any empty lines in the input.