Search code examples
sql-serverfull-text-searchcase-sensitive

Full-text search on SQL Server has case-sensitive behavior


I have a full-text index with an empty stoplist.

The table contains a row with the value 'A. Beta'. If I search for 'A. Beta' or 'A. beta' (using full-text search) the row shows up correctly. But if I search for 'a. Beta' or 'a. beta' the row does not show up.

What is happening? Is there a good workaround to this? I could remove full stops from the search string before sending it to the database, but is there anything else I should remove?

Example:

CREATE TABLE mytable (
    id BIGINT IDENTITY NOT NULL, 
    name VARCHAR(256) NOT NULL,
    CONSTRAINT mytable_pk PRIMARY KEY (id)
    );

CREATE FULLTEXT STOPLIST empty_stoplist;
CREATE FULLTEXT CATALOG mytable_catalog;
CREATE FULLTEXT INDEX ON mytable (
        name Language 1033
    )
    KEY INDEX mytable_pk
    ON mytable_catalog;
ALTER FULLTEXT INDEX ON mytable SET STOPLIST empty_stoplist;
ALTER FULLTEXT INDEX ON mytable START UPDATE POPULATION;

INSERT INTO mytable (name) VALUES ('A. Beta');

-- Wait until indexing is complete

SELECT * FROM mytable mt WHERE CONTAINS(mt.*, '"A. Beta"');
-- 1 result
SELECT * FROM mytable mt WHERE CONTAINS(mt.*, '"A. beta"');
-- 1 result
SELECT * FROM mytable mt WHERE CONTAINS(mt.*, '"a. Beta"');
-- 0 results
SELECT * FROM mytable mt WHERE CONTAINS(mt.*, '"a. beta"');
-- 0 results

Running the fts parser on 'A. Beta' and 'a. Beta' gives different results. This might be related:

-- Replace 7 with the id of 'empty_stoplist' or use NULL instead (same result).
select * from sys.dm_fts_parser('"A. Beta"', 1033, 7, 0)
-- keyword             group_id    phrase_id   occurrence  special_term     display_term  expansion_type  source_term
-- 0x0061              1           0           1           Exact Match      a             0               A. Beta
-- 0x0062006500740061  1           0           2           Exact Match      beta          0               A. Beta

vs.

-- Replace 7 with the id of 'empty_stoplist' or use NULL instead (same result).
select * from sys.dm_fts_parser('"a. Beta"', 1033, 7, 0)
-- keyword             group_id    phrase_id   occurrence   special_term     display_term  expansion_type  source_term
-- 0x0061              1           0           1            Exact Match      a             0               a. Beta
-- 0xFF                1           0           9            End Of Sentence  END OF FILE   0               a. Beta
-- 0x0062006500740061  1           0           10           Exact Match      beta          0               a. Beta

Solution

  • The problem is actually not with case sensitivity but with the period in a. Beta which the full text parser is interpreting as "end of sentence". In other words, the parser correctly interprets A. (uppercase) as an abbreviation and ignores the period but it thinks a. (lowercase) is the end of a sentence instead of an abbreviation since abbreviations are typically capitalized in English. The full text engine uses end-of-sentence markers as well as end-of-paragraph markers when counting word occurrences to avoid false positives when matching quoted phrases. (I don't know much else about how occurrence counting works other than this vague statement from How Search Query Results Are Ranked: In order to avoid false positives in phrase and proximity queries, end-of-sentence and end-of-paragraph introduce larger occurrence gaps.)

    The easiest way to work around this problem is to remove periods from the search string. You will find that both of these searches will return the 1 result you are looking for.

    SELECT * FROM mytable mt WHERE CONTAINS(mt.*, '"A Beta"');
    
    SELECT * FROM mytable mt WHERE CONTAINS(mt.*, '"a Beta"');