Search code examples
regexrubytokenizesphinxsearch

Convert SphinxSearch query syntax to boolean search string in Ruby


I've been pondering over what is the easiest way to convert the following Sphinx Search query into what is commonly used in typical web searches or portals, e.g. a boolean search string, and vice versa

(A | B) "C D" (E | "F G" | "H I J") ("K L" ("M N" | "O P")) Q R

Needs to be converted to

(A OR B) AND "C D" AND (E OR "F G" OR "H I J") AND ("K L" AND ("M N" OR "O P")) AND Q AND R

also slight variation for example purposes

(A | B) C D (E | "F G" | "H I J") ("K L" ("M N" | "O P")) Q R

should be

(A OR B) AND C AND D AND (E OR "F G" OR "H I J") AND ("K L" AND ("M N" OR "O P")) AND Q AND R

For clarity, "A" can be any word and any case, its not case sensitive. Spaces denote AND in the starting syntax unless inside quotes. So AB would simply be one word e.g. Java. The space between (A|B) isnt important (A|B) is the same as ( A | B ) or (A | B) etc. Each letter denotes a word.

Some of these queries will be quite long - upto 500 terms. Although this isn't a huge overhead to process, I'm thinking what would be the BEST (most efficient) way to convert this. Tokenization, Regex/pattern matching, simple replace, recursion etc. What would any of you recommend?


Solution

  • Readers are perhaps looking for an elegant, at least not hackish, solution to this problem. That was my objective as well, but, alas, this is the best I've been able to come up with.

    Code

    def convert(str)
      subs = []
      str.gsub(/"[^"]*"| *\| */) do |s|
        if s.match?(/ *\| */)
          '|'
        else
          subs << s
          '*'
        end
      end.gsub(/ +/, ' AND ').
          gsub(/[*|]/) { |s| s == '|' ? ' OR ' : subs.shift }
    end
    

    Examples

    puts convert(%Q{(A | B) "C D" (E | "F G" | "H I J") ("K L" ("M N" | "O P")) Q R})
      #-> (A OR B) AND "C D" AND (E OR "F G" OR "H I J") AND ("K L" AND ("M N" OR "O P")) AND Q AND R
    
    puts convert(%Q{(A|B)   C D (E| "F G" |"H I J") ("K L"   ("M N" | "O P")) Q R})
      #-> (A OR B) AND C AND D AND (E OR "F G" OR "H I J") AND ("K L" AND ("M N" OR "O P")) AND Q AND R
    

    Notice that in this example there is no space before and/or after some pipes and in some places outside double-quoted strings there are multiple spaces.

    puts convert(%Q{(Ant | Bat) Cat Dog (Emu | "Frog Gorilla" | "Hen Ibex Jackel") ("Khawla Lynx" ("Magpie Newt" | "Ocelot Penguin")) Quail Rabbit})
      #-> (Ant OR Bat) AND Cat AND Dog AND (Emu OR "Frog Gorilla" OR "Hen Ibex Jackel") AND ("Khawla Lynx" AND ("Magpie Newt" OR "Ocelot Penguin")) AND Quail AND Rabbit
    

    Here I've replaced the capital letters with words.

    Explanation

    To see how this works let

    str = %Q{(A | B) "C D" (E | "F G" | "H I J") ("K L" ("M N" | "O P")) Q R}
      #=> "(A | B) \"C D\" (E | \"F G\" | \"H I J\") (\"K L\" (\"M N\" | \"O P\")) Q R"
    

    then

    subs = []
    str.gsub(/"[^"]*"| *\| */) do |s|
      if s.match?(/ *\| */)
        '|'
      else
        subs << s
        '*'
      end
    end
      #=> "(A|B) * (E|*|*) (* (*|*)) Q R"
    
      subs
        #=> ["\"C D\"", "\"F G\"", "\"H I J\"", "\"K L\"", "\"M N\"", "\"O P\""]
    

    As you see, I have removed the spaces around pipes and replaced all quoted strings with asterisks, saving those strings in the array subs, so that I can later replace the asterisks with their original values. The choice of an asterisk is of course arbitrary.

    The regular expression reads, "match a double-quoted string of zero or more characters or a pipe ('|') optionally preceded and/or followed by spaces".

    As a result of these substitutions, all remaining strings of spaces are to be replaced by ' AND ':

    s2 = s1.gsub(' +', ' AND ')
      #=> "(A|B) AND * AND (E|*|*) AND (* AND (*|*)) AND Q AND R"
    

    It remains to replace '|' with ' OR ' and each asterisk by its original value:

    s2.gsub(/[*|]/) { |s| s == '|' ? ' OR ' : subs.shift }
      #=> "(A OR B) AND \"C D\" AND (E OR \"F G\" OR \"H I J\") AND (\"K L\" AND (\"M N\" OR \"O P\")) AND Q AND R"