sqlsql-serversql-server-2019

SQL count the number of comma-separated string match in another string


I have a table in SQL Server with two columns as shown here, all number are comma-separated:

enter image description here

IF OBJECT_ID('tempdb..#TempTable') IS NOT NULL
     DROP TABLE #TempTable

CREATE TABLE #TempTable 
(
     [Sample] varchar(33), 
     Benchmark varchar(33)
)

INSERT INTO #TempTable VALUES ('9,55,66', '55,44');
INSERT INTO #TempTable VALUES ('88,23,2,3,4', '23,88');

I need to match the data in Sample column with the Benchmark, then calculate the missing number ratio:

enter image description here

Let me explain:

  • Column C means the number 55 exist in Sample and Benchmark
  • Column D means only 1 match, which is 55
  • Column E means total number exists in Benchmark = 2
  • Column F is calculated (2-1)/2 = 50% of them are missing

I don't need column C, D ,E to be stored in the table, just a explanation purpose. WHAT I need is the column F.

I am using SQL Server 2019+.


Solution

  • First things first, you're going to need a way to uniquely identify each of your rows. I could make the assumption that sample uniquely identifies the rows, but I'm sort of guessing there might be something like an identity on the table. if there isn't one, make one using row_number(). I simply added an identity called Rid to your temp table.

    The reason you'll need this is you're going to split the samples and the benchmarks into two CTEs, call string_split on each of them, and then join them back together on the split value and the row identifier (in my example, RID, but again, ifsample is a valid key, that works too.

    The samples CTE is going to look like this:

    select RID, Value = trim(b.value)
    from #tmp a
    cross apply string_split(a.sample, ',') b
    

    and the benchmarks CTE is going to look like this:

    select RID, Value = trim(b.value)
    from #tmp a
    cross apply string_split(a.benchmark, ',') b
    

    I'm showing you these separately so you can see what each CTE contains. I'm also trimming the value in case there is whitespace in there. You could go a step further and cast them to int if you want/need, but I'll leave that as an exercise for the reader.

    Finally, left join everything back together on RID and Value, grouping by RID (i.e. the original row you were testing). The Benchmark count is just the total count of rows (or you could use bm.value; it's the same thing since benchmarks is your left table here) and the match count is the count of s.value. This works, because if there isn't a match, you'll get a null, and count(s.value) will skip any s.value where it's null.

    ;with samples as
    (
        select RID, Value = trim(b.value)
        from #tmp a
        cross apply string_split(a.sample, ',') b
    ), benchmarks as
    (
        select RID, Value = trim(b.value)
        from #tmp a
        cross apply string_split(a.benchmark, ',') b
    )
    select
        bm.RID,
        BenchmarkCount = count(1),
        MatchCount = count(s.value),
        MissingRatio = 100 - convert(decimal(9,3), (100.0 * count(s.Value)) / nullif(count(1), 0))
    from benchmarks bm
    left outer join samples s
        on bm.RID = s.RID
            and bm.Value = s.Value
    group by bm.RID
    

    From there, you can format your results however you like.