Search code examples
sqlsql-servert-sqlsql-server-2000

Best SQL query for list of records containing certain characters?


I'm working with a relatively large SQL Server 2000 DB at the moment. It's 80 GB in size, and have millions and millions of records.

I currently need to return a list of names that contains at least one of a series of illegal characters. By illegal characters is just meant an arbitrary list of characters that is defined by the customer. In the below example I use question mark, semi-colon, period and comma as the illegal character list.

I was initially thinking to do a CLR function that worked with regular expressions, but as it's SQL server 2000, I guess that's out of the question.

At the moment I've done like this:

select x from users
where 
columnToBeSearched like '%?%' OR
columnToBeSearched like '%;%' OR
columnToBeSearched like '%.%' OR
columnToBeSearched like '%,%' OR
otherColumnToBeSearched like '%?%' OR
otherColumnToBeSearched like '%;%' OR
otherColumnToBeSearched like '%.%' OR
otherColumnToBeSearched like '%,%'

Now, I'm not a SQL expert by any means, but I get the feeling that the above query will be very inefficient. Doing 8 multiple wildcard searches in a table with millions of records, seems like it could slow the system down rather seriously. While it seems to work fine on test servers, I am getting the "this has to be completely wrong" vibe.

As I need to execute this script on a live production server eventually, I hope to achieve good performance, so as not to clog the system. The script might need to be expanded later on to include more illegal characters, but this is very unlikely.

To sum up: My aim is to get a list of records where either of two columns contain a customer-defined "illegal character". The database is live and massive, so I want a somewhat efficient approach, as I believe the above queries will be very slow.

Can anyone tell me the best way for achieving my result? Thanks!

/Morten


Solution

  • It doesn't get used much, but the LIKE statement accepts patterns in a similar (but much simplified) way to Regex. This link is the msdn page for it.

    In your case you could simplify to (untested):

    select x from users
    where 
        columnToBeSearched like '%[?;.,]%' OR
        otherColumnToBeSearched like '%[?;.,]%'
    

    Also note that you can create the LIKE pattern as a variable, allowing for the customer defined part of your requirements.

    One other major optimization: If you've got an updated date (or timestamp) on the user row (for any audit history type of thing), then you can always just query rows updated since the last time you checked.