Search code examples
phpregexmongodbsearch-enginecase-insensitive

How do I make a Case Insensitive, Partial Text Search Engine that uses Regex with MongoDB and PHP?


I'm trying to improve the search bar in my application. If a user types "Titan" into the search bar right now, the application will retrieve the movie "Titanic" from MongoDB every time I use the following regex function:

require 'dbconnection.php';
if ($_SERVER["REQUEST_METHOD"] == "POST") {
   $input= $_REQUEST['input'];
$query=$collection->find(['movie' => new MongoDB\BSON\Regex($input)]);
}

I can also make collections case insensitive by creating the following index within the Mongo shell, so if a user types "tiTAnIc" into the search bar, the application will retrieve the movie "Titanic" from MongoDB:

db.createCollection("c1", { collation: { locale: 'en_US', strength: 2 } } )
db.c1.createIndex( { movie: 1 } )

I am not capable of combining these two features at the same time, however. The index above will only remove case sensitivity when I change my query to this:

$query=$collection->find( [ 'movie' => $input] );

If I use the regex query at the top in tandem with the collated index, it will ignore the regex part, so if I type "Titan," it doesn't retrieve anything; if I type "Titanic," however, it will successfully retrieve "Titanic" (because "Titanic" is the exact word stored in my database).

Any advice?


Solution

  • Beware: Regex search on indexed column will affect the performance, as stated at $regex docs:

    Case insensitive regular expression queries generally cannot use indexes effectively. The $regex implementation is not collation-aware and is unable to utilize case-insensitive indexes.

    Your problem is that MongoDB use prefix (ex: /^acme/) on $regex to lookup index.

    For case sensitive regular expression queries, if an index exists for the field, then MongoDB matches the regular expression against the values in the index, which can be faster than a collection scan. Further optimization can occur if the regular expression is a “prefix expression”, which means that all potential matches start with the same string. This allows MongoDB to construct a “range” from that prefix and only match against those values from the index that fall within that range.

    So it needs to be changed like this:

    $query=$collection->find(['movie' => new MongoDB\BSON\Regex('^'.$input, 'i')]);
    

    I suggest you design your collection more carefully.

    Related: