sql-server sql-server-2000 full-text-search

MS-SQL Server 2000 slow full text indexing

We have a full text index on a fairly large table of 633,569 records. The index is rebuilt from scratch as part of a maintenance plan every evening, after a bunch of DTS packages run that delete / insert records. Large chunks of data are deleted, then inserted (to take care of updates and inserts), so incremental indexing is not a possibility. Changing the packages to only delete when necessary is not a possibility either as it is a legacy application that will eventually be replaced.

The FTI includes two columns - one a varchar(50) not null and a varchar(255) null.

There is a clustered index on the primary key column, which is just an identity column. There is also an combined index on an integer column and the varchar(50) column mentioned above. This latter index was added for performance reasons.

The problem is that the re-indexing is painfully slow - about 8 hours.

The server is fairly robust (dual processor, 4gb of ram), and everything runs quickly beyond this re-indexing.

Any tips on how to speed this up?

UPDATE

Our client has access to the sql box. Turns out they turned on change tracking on the table that is part of the full text index. We turned this off, and the full population took less than 3 hours. Still not great, but better than 8.

UPDATE 2

The FTI is again taking ~8 hours to populate.

Solution

SQL Server's indexing is slow primarily because of its asynchronous data extraction scheme.

Use change tracking with the "update index in background" option.

The easiest way to improve the performance of full-text indexing is to use change tracking with the "update index in background" option.When you index a table (FTI, like "standard" SQL indexes, works on a per-table basis), you specify full population, incremental population, or change tracking. When you opt for full population, every row in the table you're full-text indexing is extracted and indexed. This is a two-step process.

First, you (or Enterprise Manager) run this system stored procedure:

sp_fulltext_getdata CatalogID, object_id

After all the results sets of all of the timestamps and PK values are returned to MSSearch, MSSearch will issue another sp_fulltext_getdata, but this time, once for every row in your table.So if you have 50 million rows in your database, this procedure will be issued 50 million times.

On the other hand, if you use an incremental population, MSSearch will issue an initial:

sp_fulltext_getdata CatalogID, object_id

for each row in the table that you're full-text indexing. So if you have 50 million rows in your database, this statement will also be issued 50 million times. Why? Because even with an incremental population, MSSearch must figure out exactly which rows have been changed, updated, and deleted. Another problem with incremental populations is that they'll index or re-index a row even if the change was made to a column that you aren't full-text indexing.

Although an incremental population is generally faster than a full population, you can see that for large tables, either will be time-consuming.

I recommend you enable change tracking with background or scheduled updating. If you do, you'll see that MSSearch will first issue another:

sp_fulltext_getdata CatalogID, object_id

for every row in the table with change tracking enabled.Then, for every row that has a column that you're full-text indexing and that's modified after your initial full population, the row information will be written (in the database you're indexing) to the sysfulltextnotify table. MSSearch will then issue the following only for the rows that apear in this table and will then remove them from the sysfulltextnotify table.

Consider using a separate build server

Tables that are heavily updated while you're indexing can create locking problems, so if you can live with a catalog that's periodically out of date and an MSSearch engine that's sometimes unavailable consider using a separate build server. You do this by making sure the indexing server has a copy of the table to be full-text indexed and exporting the catalog .Clearly, if you need real-time or near real-time updates to your catalog, this is not a good solution

Limit activity when population is running

When population is running, don't run Profiler, and limit other database activity as much as possible. Profiler consumes significant resources.

Increase the number of threads for the indexing process

Increase the number of threads you're running for the indexing process. The default is only five, and on quads or 8-ways, you can bump this up to much higher values. MSSearch will, however, throttle itself if it's slurping too much data from SQL Server, so avoid doing this on single- or dual-processor systems.

Stop any anti-virus or open file-agent backup software.

If this is not possible, try to prevent them from scanning the temporary directories being used by SQL FTI and the catalog directories

Place the catalog,temp directory and pagefiles on their own controllers

If you can make that investment.Place the catalog on its own controller, preferably on a RAID-1 array.Place the temp directory on a RAID-1 array. Similarly, consider putting pagefile on its own RAID-1 array with its own controller.

Consider creating secondary data files for the Temp DB - 1 per CPU / Core.