I'm trying to parse the SGF format, which has this BNF:
Collection = GameTree { GameTree }
GameTree = "(" Sequence { GameTree } ")"
Sequence = Node { Node }
Node = ";" { Property }
Property = PropIdent PropValue { PropValue }
PropIdent = UcLetter { UcLetter }
PropValue = "[" CValueType "]"
CValueType = (ValueType | Compose)
ValueType = (None | Number | Real | Double | Color | SimpleText |
Text | Point | Move | Stone)
I'm not trying to parse ValueType
into distinct types using lalrpop. I just want the raw textual contents of the properties.
I'm having a problem with my Property
rule. Specifically I have this line in a test file
;AB[dp];AB[pp]HA[6]
which is two nodes. The first node has a single Property
and the second has two. The contents of the brackets need to be .*
because anything can go in there. Arbitrary free text is a valid value for some properties.
Using lalrpop
PropValue = r"\[" <r".*"> r"\]";
fails as rule because it matches pp]HA[6
when of course it needs to match only pp
.
Reasonably, (because I have no idea how this could have been implemented),
PropValue = r"\[" <r".*?"> r"\]";
also fails, with an excellent error message:
/mnt/c/Users/mason_000/wsl/dev/rust/seraph/gosgf/src/parse_sgf.lalrpop:18:5: 18:10 error: "non-greedy" repetitions (`*?` or `+?`) are not supported in regular expressions
Yet now I am in a bind, because I need non-greedy matches here.
One thing I could do is match everything that's not a close bracket. I'm not sure if that's the intended way to resolve this particular kind of ambiguity (first time using a lalr parser). I'm also not sure if ;HA[My free text ]]]]
is a valid file that should have contents of My free text ]]]
. But if it were a valid file, this workaround wouldn't work.
And also, that didn't seem to work:
PropValue = r"\[" <r"[^\]]"> r"\]";
Fails to parse and I can't decipher exactly where.
thread 'core::sgf_replays::game189_has_6_handicap' panicked at 'called `Result::unwrap()` on an `Err` value: UnrecognizedToken { token: Some((7, Token(0, "9"), 8)), expected: ["r#\"\\\\]\"#"] }', libcore/result.rs:945:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
stack backtrace:
0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
1: std::sys_common::backtrace::print
at libstd/sys_common/backtrace.rs:71
at libstd/sys_common/backtrace.rs:59
2: std::panicking::default_hook::{{closure}}
at libstd/panicking.rs:380
3: std::panicking::default_hook
at libstd/panicking.rs:390
4: std::panicking::rust_panic_with_hook
at libstd/panicking.rs:576
5: std::panicking::begin_panic
at libstd/panicking.rs:537
6: std::panicking::begin_panic_fmt
at libstd/panicking.rs:521
7: rust_begin_unwind
at libstd/panicking.rs:497
8: core::panicking::panic_fmt
at libcore/panicking.rs:71
9: core::result::unwrap_failed
at /checkout/src/libcore/macros.rs:23
10: <core::result::Result<T, E>>::unwrap
at /checkout/src/libcore/result.rs:782
11: seraph::core::sgf_replays::game189_has_6_handicap
at src/core/mod.rs:612
12: <F as alloc::boxed::FnBox<A>>::call_box
at libtest/lib.rs:1453
at /checkout/src/libcore/ops/function.rs:223
at /checkout/src/liballoc/boxed.rs:788
13: __rust_maybe_catch_panic
at libpanic_unwind/lib.rs:102
And for completeness, here is the .lalrpop
use std::str::FromStr;
use gosgf::*;
use std::collections::HashMap;
grammar;
pub Collection: GoCollection = <GameTree*>;
match {
r"\(",
r"\)",
r";" ,
r"\[",
r"\]",
r"[A-Z]+",
}
else {
r"[^\]]",
}
GameTree: GameTree = {
r"\(" <sequence: Sequence> <children: GameTree*> r"\)" => {
let komi = f64::from_str(sequence[0].properties.get("KM").unwrap_or(&"0.0".to_owned())).unwrap();
let size = usize::from_str(sequence[0].properties.get("SZ").unwrap_or(&"19".to_owned())).unwrap();
let handicap;
{
let mut handistr = String::from("0");
for node in &sequence {
if let Some(ha) = node.properties.get("HA") {
handistr = ha.to_string();
break;
}
}
handicap = usize::from_str(&handistr).unwrap();
}
GameTree {
komi,
size,
handicap,
sequence,
children,
}
}
};
Sequence = <Node+>;
Node: Node = {
r";" <pairs: Property+> => {
let mut properties : HashMap<String, String> = HashMap::new();
for (k, v) in pairs {
properties.insert(k, v);
}
Node { properties }
}
};
Property: (String, String) = {
<k: PropIdent> <v: PropValue> => (k.to_string(), v.to_string())
};
PropIdent = <r"[A-Z]+">;
PropValue = r"\[" <r".*"> r"\]";
Here's one answer that I came up with. I dropped the multiple individual terminals and combined them all into one big terminal using the [^\]]
trick.
Property: (String, String) = {
<r"[A-Z]+\[[^\]]*\]"> => {
lazy_static! {
static ref RE : regex::Regex = regex::Regex::new(r"([A-Z]+)\[([^\]]*)\]").unwrap();
}
let cap = RE.captures(<>).unwrap();
let k = &cap[1];
let v = &cap[2];
(k.to_string(), v.to_string())
}
};
Not going to accept this answer yet because I'm not sure that this is the most elegant solution given the available tools, and I'm also wondering if it's possible to create a rule that parses
`;HA[My Text]]QB[[Nested]]`
As two properties with
("HA", "My Text]")
and ("QB", "[Nested]")
as the outcomes, or if this kind of expression is impossible to parse with a lalr(1) parser.
Edit: even though Stefan points out that valid SGF 4 files will have their close brackets escaped, I just want to know if this is possible. I'm beginning to doubt that it is, since once can never know whether a bracket is part of the text or the end of a property without looking past that property to the next thing.