I need to find all the positions in a string that are not A, G, C, or T (this is genomic data). I have figured out a way to do this using a loop (see below) but honestly I am not sure if there is a 'smarter' way of doing this.
I am using SQL Server.
DECLARE @myTest varchar(max) = 'GGCGATXAATXCCC-GCCT'
DECLARE @pos int =1
DECLARE @table1 TABLE (position int, DiffValue varchar(1))
WHILE (@pos <= LEN(@myTest))
BEGIN
INSERT INTO @table1
SELECT
@pos,
CASE
WHEN SUBSTRING(@myTest, @pos, 1) NOT IN ('A','G','C','T')
THEN SUBSTRING(@myTest, @pos, 1)
END
WHERE
SUBSTRING(@myTest, @pos, 1) NOT IN ('A','G','C','T')
SELECT @pos= @pos + 1
END
SELECT * FROM @table1
Results in
position DiffValue
7 X
11 X
15 -
Just grab a copy of NGrams8K and your all set.
DECLARE @myTest varchar(max) = 'GGCGATXAATXCCC-GCCT'
SELECT Position, Token
FROM dbo.ngrams8K(@myTest, 1)
WHERE token NOT LIKE '[AGCT]';
Results:
Position Token
-------------------- --------
7 X
11 X
15 -