Search code examples
sql-serverfull-text-indexing

Find records with an empty full text index


I am storing document binaries (mostly PDF files) in a SQL Server database and am using the Acrobat IFilter and full-text indexing to make the contents of the files searchable.

However, some of these PDFs were scanned with really cheap software that did not do OCR, and are images of documents instead of proper documents with searchable text. I would like to determine which records in the database have no searchable text so they can be OCRed and re-uploaded.

I can get the document IDs that do have at least one full-text entry by using sys.dm_fts_index_keywords_By_Document. I tried joining the distinct list of IDs with the document table to find the records that don't match, but this turned out to be incredibly slow -- I have about 20,000 documents (some hundreds of pages) and the query ran for over 20 minutes before I canceled it.

Is there a better way to do this?


Solution

  • I managed to come up with a solution that only took about 2 minutes to run on a set of 40,000 documents.

    1) Create a temp table to store the document_id values from sys.dm_fts_index_keywords_by_document.

    2) Populate it by grouping by document_id. Almost all documents will have at least some entries so choose a keyword count threshold that indicates the fulltext index has no meaningful information (I used 30 but most "bad" documents only had 3-5). In my particular case, the table storing the PDF binaries is PhysicalFile.

    3) If needed, join the temp table to whatever other tables have the information you need listed. In my particular case, MasterDocument contains the document title and I also included a few lookup tables.

    create table #PhysicalFileIDs (PhysicalFileID int, KeywordCount int)
    
    insert into #PhysicalFileIDs (PhysicalFileID, KeywordCount)
        select document_id, count(keyword) from sys.dm_fts_index_keywords_by_document (db_id(), object_id('PhysicalFile'))
        group by document_id having count(keyword) < 30
    
    select MasterDocument.DocumentID, MasterDocument.Title, ProfileType.ProfileTypeDisplayName, #PhysicalFileIDs.KeywordCount
        from MasterDocument
        inner join #PhysicalFileIDs on Masterdocument.PhysicalFileID = #PhysicalFileIDs.PhysicalFileID
        inner join DocumentType on MasterDocument.DocumentTypeID = DocumentType.DocumentTypeID
        inner join ProfileType on ProfileType.ProfileTypeID = DocumentType.ProfileTypeID
    
    drop table #PhysicalFileIDs