java mongodb line-breaks case-insensitive

OR Query mongodb from java with "like" and "line break" and "case insensitive" at the same time

This is sample of one document in my mongodb collection page_link_titles:

{
    "_id" : ObjectId("553b11f30b81511d64152416"),
    "id" : 36470831,
    "linkTitles" : [ 
        "Syrian civil war", 
        "Damascus", 
        "Geographic coordinate system", 
        "Bashar al-Assad", 
        "Al Jazeera English", 
        "Free Syrian Army", 
        ...

        "February 2012 Aleppo bombings", 
        "2012 Deir ez-Zor bombing", 
        "Aleppo University bombings"
    ]
}

I want to find all the documents that the text in their linkTitles contains a phrase like '%term1%' or '%term2%' or (so on). term1 and term2 must have a line break in both sides. For example looking into "Syrian civil war". If term1 = "war" I want this document to be returned as the result of query, however if term1 = "yria" which is a part of a word in this document, it shouldn't be returned.

This is my java code:

for (String term : segment.terms) {
    DBObject clause1 = new BasicDBObject("linkTitles",
            java.util.regex.Pattern.compile("\\b"
                    + stprocess.singularize(term) + "\\b"));
    or.add(clause1);
}

DBObject mongoQuery = new BasicDBObject("$or", or);
DBCursor cursor = pageLinks.find(mongoQuery);

In line: java.util.regex.Pattern.compile("\\b"+ stprocess.singularize(term) + "\\b")); I only assumed line break. I don't know how I should write the regex to consider all my conditions : line break, case insensitive, like.

Any ideas?

Solution

It is possible to do a regular expression that achieves what you want. You can also use a single regular expression rather using $or.

I'm using the shell for a quick example and wanting to search for boxer or cat. First insert the test data:

db.test.drop()
db.test.insert([
{ "a" : "Boxer One" },
{ "a" : "A boxer dog" },
{ "a" : "A box shouldn't match" },
{ "a" : "should match BOXER" },
{ "a" : "wont match as this it the plural BOXERs" },
{ "a" : "also match on cat" }])

Using the following regular expression we can search for all our terms:

                                       
      /(^|\b)(boxer|cat)(\b|$)/i       
       +---+ +-------+  +---+         
          |       |        |           
          |       |        |           
   Start or space |       Space or end 
                  |                    
              Search terms

And do a find like so:

db.test.find({a: /(^|\b)(boxer|cat)(\b|$)/i})

That query will return the following results:

{ "_id" : ObjectId("555f18eee7b6d1b7e622de36"), "a" : "Boxer One" }
{ "_id" : ObjectId("555f18eee7b6d1b7e622de37"), "a" : "A boxer dog" }
{ "_id" : ObjectId("555f18eee7b6d1b7e622de39"), "a" : "should match BOXER" }
{ "_id" : ObjectId("555f18eee7b6d1b7e622de3b"), "a" : "also match on cat" }

In Java you might build this query up like so:

StringBuilder singularizedTerms = new StringBuilder();
for (String term : terms) {
    singularizedTerms.append("|").append(stprocess.singularize(term));
}
String regexPattern = format("(^|\\b)(%s)(\\b|$)", singularizedTerms.substring(1));
Pattern regex = Pattern.compile(regexPattern, Pattern.CASE_INSENSITIVE);

Theres two problems with this approach.

It will be slow It can't use an index so will do a full scan of the collection, if you have 10 million documents it will check each one!
It won't match plurals For example it won't match the document containing "BOXERs" because our regular expression explicitly doesn't allow for partial matches!

Text indexes support this. Using an index will make the operation faster as well as matching plural or single values, for example:

db.test.createIndex( { a: "text" } )
db.test.find({ $text: { $search: "boxer cat"}})

{ "_id" : ObjectId("555f18eee7b6d1b7e622de3b"), "a" : "also match on cat" }
{ "_id" : ObjectId("555f18eee7b6d1b7e622de3a"), "a" : "wont match as this it the plural BOXERs" }
{ "_id" : ObjectId("555f18eee7b6d1b7e622de36"), "a" : "Boxer One" }
{ "_id" : ObjectId("555f18eee7b6d1b7e622de37"), "a" : "A boxer dog" }
{ "_id" : ObjectId("555f18eee7b6d1b7e622de39"), "a" : "should match BOXER" }