I am storing document binaries (mostly PDF files) in a SQL Server database and am using the Acrobat IFilter and full-text indexing to make the contents of the files searchable.
However, some of these PDFs were scanned with really cheap software that did not do OCR, and are images of documents instead of proper documents with searchable text. I would like to determine which records in the database have no searchable text so they can be OCRed and re-uploaded.
I can get the document IDs that do have at least one full-text entry by using sys.dm_fts_index_keywords_By_Document
. I tried joining the distinct list of IDs with the document table to find the records that don't match, but this turned out to be incredibly slow -- I have about 20,000 documents (some hundreds of pages) and the query ran for over 20 minutes before I canceled it.
Is there a better way to do this?
I managed to come up with a solution that only took about 2 minutes to run on a set of 40,000 documents.
1) Create a temp table to store the document_id values from sys.dm_fts_index_keywords_by_document.
2) Populate it by grouping by document_id. Almost all documents will have at least some entries so choose a keyword count threshold that indicates the fulltext index has no meaningful information (I used 30 but most "bad" documents only had 3-5). In my particular case, the table storing the PDF binaries is PhysicalFile
.
3) If needed, join the temp table to whatever other tables have the information you need listed. In my particular case, MasterDocument
contains the document title and I also included a few lookup tables.
create table #PhysicalFileIDs (PhysicalFileID int, KeywordCount int)
insert into #PhysicalFileIDs (PhysicalFileID, KeywordCount)
select document_id, count(keyword) from sys.dm_fts_index_keywords_by_document (db_id(), object_id('PhysicalFile'))
group by document_id having count(keyword) < 30
select MasterDocument.DocumentID, MasterDocument.Title, ProfileType.ProfileTypeDisplayName, #PhysicalFileIDs.KeywordCount
from MasterDocument
inner join #PhysicalFileIDs on Masterdocument.PhysicalFileID = #PhysicalFileIDs.PhysicalFileID
inner join DocumentType on MasterDocument.DocumentTypeID = DocumentType.DocumentTypeID
inner join ProfileType on ProfileType.ProfileTypeID = DocumentType.ProfileTypeID
drop table #PhysicalFileIDs