I'm trying to parse the following document:
val doc = """BEGIN
A Bunch
Of Text
With linebreaks
##
"""
The idea here being that when I see a ##
on a line of its own, I should consider that the end of parsing.
I've tried, using the following code to parse this document:
object MyParser extends RegexParsers {
val begin: Parser[String] = "BEGIN"
val lines: Parser[Seq[String]] = repsep(line, eol)
val line: Parser[String] = """.+""".r
val eol: Parser[Any] = "\n" | "\r\n" | "\r"
val end: Parser[String] = "##"
val document: Parser[Seq[String]] =
begin ~> lines <~ end
}
MyParser.parseAll(MyParser.document, doc)
However when I try to execute this (in an Annonite script), I get the following:
java.lang.NullPointerException
scala.util.parsing.combinator.Parsers$class.rep1sep(Parsers.scala:771)
ammonite.$file.vtt$minusparser$MyParser$.rep1sep(vtt-parser.sc:3)
scala.util.parsing.combinator.Parsers$class.repsep(Parsers.scala:687)
ammonite.$file.vtt$minusparser$MyParser$.repsep(vtt-parser.sc:3)
ammonite.$file.vtt$minusparser$MyParser$.<init>(vtt-parser.sc:5)
ammonite.$file.vtt$minusparser$MyParser$.<clinit>(vtt-parser.sc)
ammonite.$file.vtt$minusparser$.<init>(vtt-parser.sc:22)
ammonite.$file.vtt$minusparser$.<clinit>(vtt-parser.sc)
Can anyone see where I'm going wrong?
The reason for the error is that line
and eol
are defined as normal class field val
s, but they are used in lines
before their definition. The code that assigns values to class fields is executed sequentially in the constructor, and line
and eol
are both still null
, when lines
is being assigned.
To solve this define line
and eol
as lazy val
s or def
s, or just put them before lines
in the code.
The parser itself also has some problems. By default Scala parsers automatically ignore all whitespace, including EOLs. Considering that regexp .*
without any flags does not include EOLs, line
naturally means "the whole line until the line break", so you don't have to analyze EOLs at all.
Secondly, the lines
parser as defined is greedy. It will happily consume everything including the final ##
. To make it stop before end
you can, for example, use the not
combinator.
With all the changes, the parser looks like this:
object MyParser extends RegexParsers {
val begin: Parser[String] = "BEGIN"
val line: Parser[String] = """.+""".r
val lines: Parser[Seq[String]] = rep(not(end) ~> line)
val end: Parser[String] = "##"
val document: Parser[Seq[String]] =
begin ~> lines <~ end
}
You may also override the behaviour of skipping the whitespace, and analyze all whitespace manually. This includes the whitespace after BEGIN
and after the ##
:
object MyParser extends RegexParsers {
override def skipWhitespace = false
val eol: Parser[Any] = "\n" | "\r\n" | "\r"
val begin: Parser[String] = "BEGIN" <~ eol
val line: Parser[String] = """.*""".r
val lines: Parser[Seq[String]] = rep(not(end) ~> line <~ eol)
val end: Parser[String] = "##"
val document: Parser[Seq[String]] =
begin ~> lines <~ end <~ whiteSpace
}
Note, that line
is defined as .*
instead of .+
here. Like this the parser won't fail if there're any empty lines in the input.