Search code examples
sql-serversql-server-2012full-text-searchfulltext-index

Using Full-Text indexing to crawl binary blobs


If i store binary files (e.g. doc, html, xml, xps, docx, pdf) inside a varbinary(max) column in SQL Server, how can i use Full-Text indexing to crawl the binary files?

Imagine i create a table to store binary files:

CREATE TABLE Documents (
    DocumentID int IDENTITY,
    Filename nvarchar(32000),
    Data varbinary(max),
)

How can i leverage the IFilter system provided by Windows to crawl these binary files and extract useful, searchable, information?

The motivation for this, of course, is that Microsoft's Indexing Service is derprecated, and replaced with Windows Search. Indexing Service provided an OLEDB provider (MSIDX that SQL Server could use to query the Indexing Service catalog. The Indexing Service OLE DB Provider

Windows Search, on the other hand has no way to query the catalog. There is no way for SQL Server to access Windows Search.

Fortunately, the capabilities of Windows Search (and Indexing Service before it) were brought into SQL Server proper. The SQL Server Full-Text indexing service uses the same IFilter mechanism that has been around for 19 years.

The question is: how to use it to crawl blobs stored in the database.


Solution

  • SQL Server fulltext can index varbinary and image columns.

    You can see the list of all file types currently supported by SQL Server:

    SELECT * FROM sys.fulltext_document_types
    

    For example:

    | document_type | class_id                             | path                                                                             | version           | manufacturer          |
    |---------------|--------------------------------------|----------------------------------------------------------------------------------|-------------------|-----------------------|
    | .doc          | F07F3920-7B8C-11CF-9BE8-00AA004B9986 | C:\Windows\system32\offfilt.dll                                                  | 2008.0.9200.16384 | Microsoft Corporation |
    | .txt          | C7310720-AC80-11D1-8DF3-00C04FB6EF4F | c:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\Binn\msfte.dll   | 12.0.6828.0       | Microsoft Corporation |
    | .xls          | F07F3920-7B8C-11CF-9BE8-00AA004B9986 | C:\Windows\system32\offfilt.dll                                                  | 2008.0.9200.16384 | Microsoft Corporation |
    | .xml          | 41B9BE05-B3AF-460C-BF0B-2CDD44A093B1 | c:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\Binn\xmlfilt.dll | 12.0.9735.0       | Microsoft Corporation |
    

    When creating the varbinary (or image) column to contain your binary file, you must have another string column that gives the file type through its extension (e.g. ".doc")

    CREATE TABLE Documents (
        DocumentID int IDENTITY,
        Filename nvarchar(32000),
        Data varbinary(max),
        DataType varchar(50) --contains the file extension (e.g. ".docx", ".pdf")
    )
    

    When adding the binary column to the full-text index SQL Server needs you to tell it which column contains the data type string:

    ALTER FULLTEXT INDEX ON [dbo].[Documents] 
    ADD ([Data] TYPE COLUMN [DataType])
    

    You can test by importing a binary file from the filesystem on the server:

    INSERT INTO Documents(filename, DataType, data) 
    SELECT 
       'Managing Storage Spaces with PowerShell.doc' AS Filename, 
       '.doc', * 
    FROM OPENROWSET(BULK N'C:\Managing Storage Spaces with PowerShell.doc', SINGLE_BLOB) AS Data
    

    You can view the catalog status using:

    DECLARE @CatalogName varchar(50);
    SET @CatalogName = 'Scratch';
    
    SELECT
        CASE FULLTEXTCATALOGPROPERTY(@CatalogName, 'PopulateStatus')
        WHEN 0 THEN 'Idle'
        WHEN 1 THEN 'Full population in progress'
        WHEN 2 THEN 'Paused'
        WHEN 3 THEN 'Throttled'
        WHEN 4 THEN 'Recovering'
        WHEN 5 THEN 'Shutdown'
        WHEN 6 THEN 'Incremental population in progress'
        WHEN 7 THEN 'Building index'
        WHEN 8 THEN 'Disk is full. Paused.'
        WHEN 9 THEN 'Change tracking'
        ELSE 'Unknown'
        END+' ('+CAST(FULLTEXTCATALOGPROPERTY(@CatalogName, 'PopulateStatus') AS varchar(50))+')' AS PopulateStatus,
        FULLTEXTCATALOGPROPERTY(@CatalogName, 'ItemCount') AS ItemCount,
        CAST(FULLTEXTCATALOGPROPERTY(@CatalogName, 'IndexSize') AS varchar(50))+ ' MiB' AS IndexSize,
        CAST(FULLTEXTCATALOGPROPERTY(@CatalogName, 'UniqueKeyCount') AS varchar(50))+' words' AS UniqueKeyCount,
        FULLTEXTCATALOGPROPERTY(@CatalogName, 'PopulateCompletionAge') AS PopulateCompletionAge,
        DATEADD(second, FULLTEXTCATALOGPROPERTY(@CatalogName, 'PopulateCompletionAGe'), 0) AS PopulateCompletionDate
    

    And you can query the catalog:

    SELECT * FROM Documents
    WHERE FREETEXT(Data, 'Bruce')
    

    Additional IFilters

    SQL Server has a limited set of built-in filters. It can also use IFilter implementations registered on the system (e.g. Microsoft Office 2010 Filter Pack that provides docx, msg, one, pub, vsx, xlsx and zip support).

    You must enable the use of the OS-level filters by enabling the option:

    sp_fulltext_service 'load_os_resources', 1
    

    and restart the SQL Server service.

    load_os_resources int

    Indicates whether operating system word breakers, stemmers, and filters are registered and used with this instance of SQL Server. One of:

    0: Use only filters and word breakers specific to this instance of SQL Server.
    1: Load operating system filters and word breakers.

    By default, this property is disabled to prevent inadvertent behavior changes by updates made to the operating system. Enabling use of operating system resources provides access to resources for languages and document types registered with Microsoft Indexing Service that do not have an instance-specific resource installed. If you enable the loading of operating system resources, ensure that the operating system resources are trusted signed binaries; otherwise, they cannot be loaded when verify_signature is set to 1.

    If using SQL Server before SQL Server 2008, you must also restart the Full-Text indexing service after enabling this option:

    net stop msftesql
    net start msftesql
    

    Microsoft provides filter packs contain IFilter for the Office 2007 file types:

    And Adobe provides an IFilter for indexing PDFs (Foxit provides one, but theirs is not free):

    Bonus Reading