Search code examples
perlsearchparsingcontext-free-grammardbix-class

Parsing a Gmail-style advanced search syntax?


I want to parse a search string similar to that provided by Gmail using Perl. An example input would be "tag:thing by:{user1 user2} {-tag:a by:user3}". I want to put it into a tree structure, such as

{and => [
    "tag:thing",
    {or => [
       "by:user1",
       "by:user2",
    ]},
    {or => [
       {not => "tag:a"},
       "by:user3",
    ]},
}

The general rules are:

  1. Tokens separated by space default to the AND operator.
  2. Tokens in braces are alternative options (OR). The braces can go before or after the field specifier. i.e. "by:{user1 user2}" and "{by:user1 by:user2}" are equivalent.
  3. Tokens prefixed with a hyphen are excluded.

These elements can also be combined and nested: e.g. "{by:user5 -{tag:k by:user3}} etc".

I'm thinking of writing a context-free grammar to represent these rules, and then parsing it into the tree. Is this unnecessary? (Is this possible using simple regexps?)

What modules are recommended for doing parsing context-free grammars?

(Eventually this will be used to generate an database query with DBIx::Class.)


Solution

  • Regex doesn't do nested things (like parenthesis) very well. By the time you get your regex counting parenthesis and capturing correctly, you could probably have a decent CFG parser. CFGs can logically guarantee correct parsing, while with a regex solution you're leaving a lot up to the magic. I can't recommend any Perl CFG libaries, but coding one sounds very cathartic.