Search code examples
sql-serverstringt-sqlcharindex

How can I use T-SQL to find all positions that are not in a list of characters?


I need to find all the positions in a string that are not A, G, C, or T (this is genomic data). I have figured out a way to do this using a loop (see below) but honestly I am not sure if there is a 'smarter' way of doing this.

I am using SQL Server.

DECLARE @myTest varchar(max) = 'GGCGATXAATXCCC-GCCT'

DECLARE @pos int =1 

DECLARE @table1 TABLE (position int, DiffValue varchar(1))

WHILE (@pos <= LEN(@myTest)) 
BEGIN 
    INSERT INTO @table1
        SELECT 
            @pos,
            CASE
               WHEN SUBSTRING(@myTest, @pos, 1) NOT IN ('A','G','C','T') 
                  THEN SUBSTRING(@myTest, @pos, 1)                                              
            END
        WHERE 
            SUBSTRING(@myTest, @pos, 1) NOT IN ('A','G','C','T')
    SELECT @pos= @pos + 1
END

SELECT * FROM @table1

Results in

position    DiffValue
   7            X
   11           X
   15           -

Solution

  • Just grab a copy of NGrams8K and your all set.

    DECLARE @myTest varchar(max) = 'GGCGATXAATXCCC-GCCT'
    
    SELECT Position, Token 
    FROM dbo.ngrams8K(@myTest, 1)
    WHERE token NOT LIKE '[AGCT]';
    

    Results:

    Position             Token
    -------------------- --------
    7                    X
    11                   X
    15                   -