Search code examples
sql-serverindexinglucene.netsql-server-2008-r2full-text-indexing

SQL Full Text Indexer, exact matches and escaping


I'm trying to replace a Keyword Analyser based Lucene.NET index with an SQL Server 2008 R2 based one.

I have a table that contains custom indexed fields that I need to query upon. The value of the index column (see below) is a combination of name/ value pairs of the custom index fields from a series of .NET types - the actual values are pulled from attributes at run time, because the structure is unknown.

I need to be able to search for set name and value pairs, using ANDs and ORs and return the rows where the query matches.

Id          Index
====================================================================
1           [Descriptor.Type]=[5][Descriptor.Url]=[/]
2           [Descriptor.Type]=[23][Descriptor.Url]=[/test]
3           [Descriptor.Type]=[25][Descriptor.Alternative]=[hello]
4           [Descriptor.Type]=[26][Descriptor.Alternative]=[hello][Descriptor.FriendlyName]=[this is a test]

A simple query look like this:

select * from Indices where contains ([Index], '[Descriptor.Url]=[/]');

That query will results in the following error:

Msg 7630, Level 15, State 2, Line 1
Syntax error near '[' in the full-text search condition '[Descriptor.Url]=[/]'.

So with that in mind, I altered the data in the Index column to use | instead of [ and ]:

select * from Indices where contains ([Index], '|Descriptor.Url|=|/|');

Now, while that query is now valid, when I run it all rows containing Descriptor.Url and starting with / are returned, instead of the records (exactly one in this case) that exactly matches.

My question is, how can I escape the query to account for the [ and ] and ensure that just the exact matching row is returned?

A more complex query looks a little like this:

select * from Indices where contains ([Index], '[Descriptor.Type]=[12] AND ([Descriptor.Url]=[/] OR [Descriptor.Url]=[/test])');

Thanks,
Kieron


Solution

  • Your main issue is in using a SQL wordbreaker, and the CONTAINS syntax. By default, SQL wordbreakers eliminates punctuation, and normalizes numbers, dates, urls, email addresses, and the like. It also lowercases everything, and stems words.

    So, for your input string:

    [Descriptor.Type]=[5][Descriptor.Url]=[/]

    You would have the following tokens added to the index (along with their positions)

    descriptor type nn5 5 descriptor url

    (Note: the nn5 is a way to simplify quering numbers and dates given in different formats, the original number is also indexed at the same position)

    So, as you can see, the punctutation is not even stored in the full text index, and thus, there is no way to query it using the CONTAINS statement.

    So your statement:

    select * from Indices where contains ([Index], '|Descriptor.Url|=|/|'); 
    

    Would actually be normalized down to "descriptor url" by the query generator before submitting it to the full text index, thus the hits on all the entries that have "descriptor" next to "url", excluding punctuation.

    What you need is the LIKE statement.