link textI want to use the TIFF IFilter built in to Windows 2008 Server R2 with Full-Text search in SQL Server 2008... also R2.
I have installed the filter through server manager and updated the "Force TIFF IFilter to perform OCR for every page in a TIFF document" Local Group Policy setting in Computer Configuration -> Administrative Templates -> OCR to "Enabled."
I have also created a full-text catalog and a table called "FileData" that looks like this:
CREATE TABLE [FileServer].[FileData](
[FileDataId] [int] IDENTITY(1,1) NOT NULL,
[FileGUID] [uniqueidentifier] ROWGUIDCOL NOT NULL,
[Data] [varbinary](max) FILESTREAM NOT NULL,
[Extension] [nvarchar](100) NULL,
[Filename] [nvarchar](256) NULL,
[Path] [nvarchar](256) NULL,
CONSTRAINT [PK_FileData_FileDataId] PRIMARY KEY CLUSTERED
(
[FileDataId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] FILESTREAM_ON [FILES],
CONSTRAINT [UX_File_FileGUID] UNIQUE NONCLUSTERED
(
[FileGUID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] FILESTREAM_ON [FILES]
GO
SET ANSI_PADDING OFF
GO
ALTER TABLE [FileServer].[FileData] ADD CONSTRAINT [DF_FileData_FileGUID] DEFAULT (newid()) FOR [FileGUID]
GO
ALTER TABLE [FileServer].[FileData] ADD CONSTRAINT [DF_FileData_FileData] DEFAULT (0x) FOR [Data]
GO
When I insert a file into that table, like a PDF or word DOC, I can hit keywords in the file moments later with a fulltext search:
I made a big huge TIFF file with very clear text (1024 x 768... about 12 words) and imported THAT into the FileData table. I can find every word in it.
SELECT [Path], [Filename], [Data]
FROM [FileServer].[FileData]
WHERE FREETEXT(*, 'Jason') and FREETEXT(Extension, 'tif');
However, when I use a "real" TIFF file, like a datasheet from a manufacturer, I get ZERO results when searching for keywords. I do not have a clue as to why, and there is not much online troubleshooting this with SQL Server.
I have tried saving the .TIFF file with various kinds of compression, without compression, etc... and I am just not having any luck. The text in my test file is CRYSTAL clear and still pretty large. I cannot imagine the the file clarity is the problem, allthough I suppose that is possible.
Just so you would have something to compare, I took the following two images and imported them:
WORKING SAMPLE FILE BROKEN SAMPLE FILE
The results for the working sample are REALLY good. These are the keywords from the working sample in the full-text index: $3.50 © 0004 08 1989 2010 21 21:35:42 235 282 3116 3702 40 48109 89 abounds absorb abstract accompanied acquired act action advantages agency algorithm algorithms already amounts amsterdam analyze ann appeared applications arbor arnficioj artficia1 assignment b.v. based basis booker brigade bucket building bv capabilities carefully changing characteristics checkers classifier classtfier closing cognitive comparing competing complex complexities complexity computer confronting confuse consider continual continually continuously contrived credit cures d.e. data de decent defined definition design designed devising discovery discussion disturbing during ecological economic eecs effort elsevier END OF FILE engineering environment environments err even events example exhibit experience expressed extant extensions face faces feasible file firing first flow following format game generates generic genetic giving goals goldberg good holiadd holland however hypotheses image immersed immune impinging implicitly inexactly information intelligence interest intervene introduction irrelevant j.h. jh journal l.b. large lb learn learning lifespan long machine mammal mammalian mammal's massively message mi michigan new nn0004 nn08 nn1989 nn2010 nn21 nn235 nn282 nn3116 nn3702 nn3d5$ nn40 nn48109 nn89 noisy north nos novel novelty obtainable often one operate option originally outside own paper parallel passing pattern payoff permission perpetual perpetually play player plays possible pretty problems provide publisher publishers quickly randomly rarely real realistic reinforcement repeatedly reprinted requirements retina reviews revise robotic rule rules science sequences sets significantly simple simply small sparse system systems tagged techniques theory thor tiff time tt2135 twice twists two typically u.s.a. university upon us usa visual vol without wonder world
But the results from the Broken Sample are just... well, vacant. Not a single word from the actual TIFF image: 08 2010 21 21:49:22 END OF FILE file format image nn08 nn2010 nn21 tagged tiff tt2149
If anybody has any ideas on what to try next, I'm ALL ears.
Well, it turns out the actual problem was the SIZE of the image. The OCR in the ITFF IFilter just wasn't even attempting to process it... too big. I had to discover this by trial and error, and could not find any documentation stating the maximum size/DPI of the incoming TIFF. Anybody know these specs? This article appears to have some information: support.microsoft.com/kb/837847 But is specific to Sharepoint, and I have not had time to mess with the settings to see if it works. Also, I'd really need to just remove the size cap. Ideas there?