Search code examples
rustlalr

rust lalrpop lexing ambiguity: non-greedy matching inside of brackets


I'm trying to parse the SGF format, which has this BNF:

  Collection = GameTree { GameTree }
  GameTree   = "(" Sequence { GameTree } ")"
  Sequence   = Node { Node }
  Node       = ";" { Property }
  Property   = PropIdent PropValue { PropValue }
  PropIdent  = UcLetter { UcLetter }
  PropValue  = "[" CValueType "]"
  CValueType = (ValueType | Compose)
  ValueType  = (None | Number | Real | Double | Color | SimpleText |
        Text | Point  | Move | Stone)

I'm not trying to parse ValueType into distinct types using lalrpop. I just want the raw textual contents of the properties.

I'm having a problem with my Property rule. Specifically I have this line in a test file

;AB[dp];AB[pp]HA[6]

which is two nodes. The first node has a single Property and the second has two. The contents of the brackets need to be .* because anything can go in there. Arbitrary free text is a valid value for some properties.

Using lalrpop

PropValue = r"\[" <r".*"> r"\]";

fails as rule because it matches pp]HA[6 when of course it needs to match only pp.

Reasonably, (because I have no idea how this could have been implemented),

PropValue = r"\[" <r".*?"> r"\]";

also fails, with an excellent error message:

/mnt/c/Users/mason_000/wsl/dev/rust/seraph/gosgf/src/parse_sgf.lalrpop:18:5: 18:10 error: "non-greedy" repetitions (`*?` or `+?`) are not supported in regular expressions

Yet now I am in a bind, because I need non-greedy matches here.

One thing I could do is match everything that's not a close bracket. I'm not sure if that's the intended way to resolve this particular kind of ambiguity (first time using a lalr parser). I'm also not sure if ;HA[My free text ]]]] is a valid file that should have contents of My free text ]]]. But if it were a valid file, this workaround wouldn't work.

And also, that didn't seem to work:

PropValue = r"\[" <r"[^\]]"> r"\]";

Fails to parse and I can't decipher exactly where.

        thread 'core::sgf_replays::game189_has_6_handicap' panicked at 'called `Result::unwrap()` on an `Err` value: UnrecognizedToken { token: Some((7, Token(0, "9"), 8)), expected: ["r#\"\\\\]\"#"] }', libcore/result.rs:945:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::print
             at libstd/sys_common/backtrace.rs:71
             at libstd/sys_common/backtrace.rs:59
   2: std::panicking::default_hook::{{closure}}
             at libstd/panicking.rs:380
   3: std::panicking::default_hook
             at libstd/panicking.rs:390
   4: std::panicking::rust_panic_with_hook
             at libstd/panicking.rs:576
   5: std::panicking::begin_panic
             at libstd/panicking.rs:537
   6: std::panicking::begin_panic_fmt
             at libstd/panicking.rs:521
   7: rust_begin_unwind
             at libstd/panicking.rs:497
   8: core::panicking::panic_fmt
             at libcore/panicking.rs:71
   9: core::result::unwrap_failed
             at /checkout/src/libcore/macros.rs:23
  10: <core::result::Result<T, E>>::unwrap
             at /checkout/src/libcore/result.rs:782
  11: seraph::core::sgf_replays::game189_has_6_handicap
             at src/core/mod.rs:612
  12: <F as alloc::boxed::FnBox<A>>::call_box
             at libtest/lib.rs:1453
             at /checkout/src/libcore/ops/function.rs:223
             at /checkout/src/liballoc/boxed.rs:788
  13: __rust_maybe_catch_panic
             at libpanic_unwind/lib.rs:102

And for completeness, here is the .lalrpop

use std::str::FromStr;
use gosgf::*;
use std::collections::HashMap;
grammar;

pub Collection: GoCollection = <GameTree*>;

match {
    r"\(",
     r"\)",
     r";" ,
     r"\[",
     r"\]",
    r"[A-Z]+",
}
else {
    r"[^\]]",
}
GameTree: GameTree = {
    r"\(" <sequence: Sequence> <children: GameTree*> r"\)" => {

        let komi = f64::from_str(sequence[0].properties.get("KM").unwrap_or(&"0.0".to_owned())).unwrap();
        let size = usize::from_str(sequence[0].properties.get("SZ").unwrap_or(&"19".to_owned())).unwrap();

        let handicap;
        {
            let mut handistr = String::from("0");
            for node in &sequence {
                if let Some(ha) = node.properties.get("HA") {
                    handistr = ha.to_string();
                    break;
                }
            }

            handicap =  usize::from_str(&handistr).unwrap();
        }

        GameTree {
            komi,
            size,
            handicap,
            sequence,
            children,
        }
    }
};

Sequence = <Node+>;

Node: Node  = {
    r";" <pairs: Property+> => {
        let mut properties : HashMap<String, String> = HashMap::new();
        for (k, v) in pairs {
            properties.insert(k, v);
        }
        Node { properties }
    } 
};

Property: (String, String) = {
    <k: PropIdent> <v: PropValue> => (k.to_string(), v.to_string())
};
PropIdent = <r"[A-Z]+">;
PropValue = r"\[" <r".*"> r"\]";

Solution

  • Here's one answer that I came up with. I dropped the multiple individual terminals and combined them all into one big terminal using the [^\]] trick.

    Property: (String, String) = {
        <r"[A-Z]+\[[^\]]*\]"> => {
            lazy_static! {
                static ref RE : regex::Regex = regex::Regex::new(r"([A-Z]+)\[([^\]]*)\]").unwrap();
            }
    
            let cap = RE.captures(<>).unwrap();
    
            let k = &cap[1];
            let v = &cap[2];
            (k.to_string(), v.to_string())
        }
    };
    

    Not going to accept this answer yet because I'm not sure that this is the most elegant solution given the available tools, and I'm also wondering if it's possible to create a rule that parses

    `;HA[My Text]]QB[[Nested]]`
    

    As two properties with ("HA", "My Text]") and ("QB", "[Nested]") as the outcomes, or if this kind of expression is impossible to parse with a lalr(1) parser.

    Edit: even though Stefan points out that valid SGF 4 files will have their close brackets escaped, I just want to know if this is possible. I'm beginning to doubt that it is, since once can never know whether a bracket is part of the text or the end of a property without looking past that property to the next thing.