Search code examples
c#asp.netsql-serversearch-enginefull-text-indexing

Save a binary file in SQL Server as BLOB and text (or get the text from Full-Text index)


Currently we are saving files (PDF, DOC) into the database as BLOB fields. I would like to be able to retrieve the raw text of the file to be able to manipulate it for hit-highlighting and other functions.

Does anyone know of a simple way to either parse out the files and save the raw text on save, either via SQL or .net code. I have found that Adobe has a filtdump utility that will convert the PDF to text. Filtdump seems to be a command line tool, and i don't see a way to use a file stream. And what would the extractor be for Office documents and other file types?

-or-

Is there a way to pull out the raw text from the SQL Full text index, without using 3rd party filters?

Note i am trying to build a .net & MSSql solution without having to use a third party tool such as Lucene


Solution

  • SQL Server Full-Text Search feature uses IFilters for extracting plain text from PDF or Office file formats. You can install IFilters on your server or if your code is running on the same machine as SQL Server you're already have it.

    Here is an article which shows how to use IFilters from .NET: http://www.codeproject.com/KB/cs/IFilter.aspx