Search code examples
sqlsql-serverindexingwildcardsql-like

Optimising LIKE expressions that start with wildcards


I have a table in a SQL Server database with an address field (ex. 1 Farnham Road, Guildford, Surrey, GU2XFF) which I want to search with a wildcard before and after the search string.

SELECT *
FROM Table
WHERE Address_Field LIKE '%nham%'

I have around 2 million records in this table and I'm finding that queries take anywhere from 5-10s, which isn't ideal. I believe this is because of the preceding wildcard.

I think I'm right in saying that any indexes won't be used for seek operations because of the preceeding wildcard.

Using full text searching and CONTAINS isn't possible because I want to search for the latter parts of words (I know that you could replace the search string for Guil* in the below query and this would return results). Certainly running the following returns no results

SELECT *
FROM Table
WHERE CONTAINS(Address_Field, '"nham"')

Is there any way to optimise queries with preceding wildcards?


Solution

  • Here is one (not really recommended) solution.

    Create a table AddressSubstrings. This table would have multiple rows per address and the primary key of table.

    When you insert an address into table, insert substrings starting from each position. So, if you want to insert 'abcd', then you would insert:

    • abcd
    • bcd
    • cd
    • d

    along with the unique id of the row in Table. (This can all be done using a trigger.)

    Create an index on AddressSubstrings(AddressSubstring).

    Then you can phrase your query as:

    SELECT *
    FROM Table t JOIN
         AddressSubstrings ads
         ON t.table_id = ads.table_id
    WHERE ads.AddressSubstring LIKE 'nham%';
    

    Now there will be a matching row starting with nham. So, like should make use of an index (and a full text index also works).

    If you are interesting in the right way to handle this problem, a reasonable place to start is the Postgres documentation. This uses a method similar to the above, but using n-grams. The only problem with n-grams for your particular problem is that they require re-writing the comparison as well as changing the storing.