Search code examples
phpmysqlregexsearch-engine

how to break apart a search query


I'm developing a search engine for a CCG. I want the user to be able to find cards based on a query like, "blue brigade hero enhancements that can discard ec's" or "purple kings of israel". There are many variables to search for: brigades (purple, blue), types (heroes, evil characters [ec's]), special abilities (discard), and identifiers (kings of israel). I'm thinking about regexing to find common search parameters. I know this won't be easy, and it will take a long time to fine tune, but can someone point me in the right direction? Is regex even a recommend solution? I don't know if it's important, but I'm using php and mysql.


Solution

  • You'll have to write a parser to parse such query strings.

    Regular expressions will be useful to find 'verbs' and 'nouns' in query strings, but you probably will also need a non-context grammar describing the language of your queries, for example something like this:

    <QUERY> := <TARGET_SPEC>
    <TARGET_SPEC> := <OBJECT> 'that can' <ABILITY>
    <TARGET_SPEC> := <OBJECT>  
    <OBJECT> := <COLOR> <WHAT> 
    <OBJECT> := <WHAT>
    <COLOR> := 'blue' | 'red' | 'purple' | 'green' 
    <WHAT> := <ITEM> | <HERO>
    <ITEM> := <ADJECTIVE> <ITEM>
    <ADJECTIVE> := 'brigade' | 'hero' | 'magic' | 'enhanced' | 'rustproof'
    <ITEM> := 'enhancements' | 'sword' | 'potion'
    <HERO> := <HERO> 'of' <COUNTRY>
    <HERO> := 'kings' | 'knights' | 'thiefs'
    <COUNTRY> := 'israel' | 'palestine' | 'jordan' | 'egypt'
    <ABILITY> := <ABILITY> 'and' <ABILITY>
    <ABILITY> := 'swim' | 'dance' | discard <DISCARDABLE> | 'kill' <HERO> | 'use' <ITEM>
    <DISCARDABLE> := 'ec's' | 'et's' | 'etc'
    

    Parser built around such a grammar will be able to determine which part of your query is an object, which is an ability, color, country etc. For example, given input string 'red knights of jordan that can swim', parser will choose right rules and apply them:

    <QUERY> := 'red knights of jordan that can swim'
    <TARGET_SPEC> := 'red knights of jordan that can swim'
    <TARGET_SPEC> := 'red knights of jordan' 'that can' 'swim'
    <OBJECT> := 'red knights of jordan'
    <ABILITY> := 'swim'
    <COLOR> := 'red'
    <WHAT> := 'knights of jordan'
    <HERO> := 'knights' 'of' 'jordan'
    <HERO> := 'knights'
    <COUNTRY> := 'jordan'
    

    Basing on extracted information you'll be able to create search criteria.

    Using a grammar has an additional benefit of resolving some ambiguities that would be hard to resolve any other way - for example, if user asks for 'red kings that can kill white knights', simple algorithms that just look for color by matching every word with list of colors available will fail.

    I recommend reading a book on compiler design - Dragon Book is a classic choice (you don't have to read all of it, just the part about lexers and parsers).

    If you don't want to code the entire parser yourself (as this can be pretty time-consuming and error-prone), you'll need a parser generator (that is, a program that creates parser source code for given grammar); here is a question with some suggestions for PHP.

    You should also consider reading up on Natural Language Processing techniques. There is an online course from Stanford University here, I'm 'attending' it now and can wholeheartedly recommend it.