Search code examples
javascalatextmatchingbooleanquery

How to do Java String matching using Boolean Search Syntax?


I'm looking for a Java/Scala library that can take an user query and a text and returns if there was a matching or not.

I'm processing a stream of information, ie: Twitter Stream, and can't afford to use a batching process, I need to evaluate each tweet in realtime, instead of index it through Lucene RAMDisk and querying it later.

It's possible create a parser/lexer using ANTLR but this is such common usage that I can't believe nobody create a lib before.

Some samples from TextQuery Ruby library that does exactly what I need:

    TextQuery.new("'to be' OR NOT 'to_be'").match?("to be")   # => true

    TextQuery.new("-test").match?("some string of text")      # => true
    TextQuery.new("NOT test").match?("some string of text")   # => true

    TextQuery.new("a AND b").match?("b a")                    # => true
    TextQuery.new("a AND b").match?("a c")                    # => false

    q = TextQuery.new("a AND (b AND NOT (c OR d))")
    q.match?("d a b")                                         # => false
    q.match?("b")                                             # => false
    q.match?("a b cdefg")                                     # => true

    TextQuery.new("a~").match?("adf")                         # => true
    TextQuery.new("~a").match?("dfa")                         # => true
    TextQuery.new("~a~").match?("daf")                        # => true
    TextQuery.new("2~a~1").match?("edaf")                     # => true
    TextQuery.new("2~a~2").match?("edaf")                     # => false

    TextQuery.new("a", :ignorecase => true).match?("A b cD")  # => true

Once it was implemented in Ruby it's not suitable for my platform, also I can't use JRuby just for this point on our solution:

I found a similar question but couldn't get answer from it: Boolean Query / Expression to a Concrete syntax tree

Thanks!


Solution

  • Given that you are doing text search, I would try to leverage some of the infrastructure provided by Lucene. May be you could create a QueryParser and call parse to get back a Query. Instantiable subclasses of Query are:

    TermQuery
    MultiTermQuery
    BooleanQuery
    WildcardQuery
    PhraseQuery
    PrefixQuery
    MultiPhraseQuery
    FuzzyQuery
    TermRangeQuery
    NumericRangeQuery
    SpanQuery
    

    Then you may be able to use pattern matching to implement what a match means for your application:

    def match_?(tweet: String, query: Query): Boolean = query match {
      case q: TermQuery => tweet.contains(q.getTerm.text)
      case q: BooleanQuery => 
        // return true if all must clauses are satisfied
        // call match_? recursively
      // you need to cover all subclasses above
      case _ => false
    }
    
    val q = queryParser.parse(userQuery)
    val res = match_?(tweet, q)
    

    Here is an implementation. It surely has bugs but you'll get the idea and it shows a working proof of concept. It re-uses the syntax, documentation and grammer of the default Lucene QueryParser.