Search code examples
sqlsql-serverpattern-matchingsql-like

SQL query - LEFT 1 = char, RIGHT 3-5 = numbers in Name


I need to filter out junk data in SQL (SQL Server 2008) table. I need to identify these records, and pull them out.

  • Char[0] = A..Z, a..z
  • Char[1] = 0..9
  • Char[2] = 0..9
  • Char[3] = 0..9
  • Char[4] = 0..9

{No blanks allowed}

Basically, a clean record will look like this:

  • T1234, U2468, K123, P50054 (4 record examples)

Junk data looks like this:

  • T12.., .T12, MARK, TP1, SP2, BFGL, BFPL (7 record examples)

Can someone please assist with a SQL query to do a LEFT and RIGHT method and extract those characters, and do a LIKE IN or something?

A function would be great though!


Solution

  • The following should work in a few different systems:

    SELECT * 
    FROM TheTable
    WHERE Data LIKE '[A-Za-z][0-9][0-9][0-9][0-9]%'
    AND Data NOT LIKE '% %'
    

    This approach will indeed match P2343, P23423JUNK, and other similar text but requires that the format is A0000*.

    Now, if the OP implies a format of 1st position is a character and all succeeding positions are numeric, as in A0+, then use the following (in SQL Server and a good deal of other database systems):

    SELECT *
    FROM TheTable
    WHERE SUBSTRING(Data, 1, 1) LIKE '[A-Za-z]'
    AND SUBSTRING(Data, 2, LEN(Data) - 1) NOT LIKE '%[^0-9]%'
    AND LEN(Data) >= 5
    

    To incorporate this into a SQL Server 2008 function, since this appears to be what you'd like most, you can write:

    CREATE FUNCTION ufn_IsProperFormat(@data VARCHAR(50))
    RETURNS BIT
    AS
    BEGIN
        RETURN 
         CASE 
          WHEN SUBSTRING(@Data, 1, 1) LIKE '[A-Za-z]'
            AND SUBSTRING(@Data, 2, LEN(@Data) - 1) NOT LIKE '%[^0-9]%'
            AND LEN(@Data) >= 5 THEN 1 
           ELSE 0 
          END
    END
    

    ...and call into it like so:

    SELECT * 
    FROM TheTable
    WHERE dbo.ufn_IsProperFormat(Data) = 1
    

    ...this query needs to change for Oracle queries because Oracle doesn't appear to support bracket notation in LIKE clauses:

    SELECT *
    FROM TheTable
    WHERE REGEXP_LIKE(Data, '^[A-za-z]\d{4,}$')
    

    This is the expansion gbn is doing in his answer, but these versions allow for varying string lengths without the OR conditions.

    EDIT: Updated to support examples in SQL Server and Oracle for ensuring the format A0+, so that A1324, A2342388, and P2342 match but A2342JUNK and A234 do not.

    The Oracle REGEXP_LIKE code was borrowed from Mark's post but updated to support 4 or more numeric digits.

    Added a custom SQL Server 2008 approach which implements these techniques.