Search code examples
sql-serversql-server-2008-r2query-performancetable-statistics

SQL Server optimal query based on value distribution


I have a table that stores dynamic user data in a key-value pair format. Something like this:

UserId | Key       | Value
---------------------------------
1      | gender    | male
1      | country   | Australia
2      | gender    | male
2      | country   | US
3      | gender    | female
3      | country   | Spain

Now, I need to select the users that have certain parameters, for example: gender is 'male' AND country is 'US'. Or more general:

key1=value1 AND key2=value2 AND key3=value3 AND ...

To do so, the fastest way I found is to do the following:

WHERE key=(key1) 
AND   value=value1
AND   EXISTS(SELECT 1
             FROM (...)
             WHERE key=key2
             AND   value=value2)
AND   EXISTS(SELECT 1
             FROM (...)
             WHERE key=key3
             AND   value=value3)
AND   EXISTS(...)

In that case, I'll get the best result if the first WHERE filter is for the one whose values are more evenly and segregated.

For example, 'gender' can have 99% males and 1% females and country can partitionate the entire population in 100 similar parts. In that case I would need to filter by country first and use EXIST for the gender condition.

Question: Is there any way in SQL Server 2008 R2 to get the index statistics to find what clause is better to put first (not in the EXISTS basically) ?.

Alternative question: I think that that's the best approach, but a way to rewrite that query to be always optimal can be the solution too.

SOLUTION INFORMATION:

The correct solution is the one explained by @usr below (using INTERSECT). Actually it seems I was doing something wrong, the EXISTS was being resolved correctly too by the engine. To provide more info, I am sharing the IO and TIME statistics as well as the execution plan for the options tested:

Using INTERSECT:

Table 'PERFTEST'. Scan count 2, logical reads 113, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms,  elapsed time = 2 ms.

Query plan using INTERSECT

Using EXISTS:

Table 'PERFTEST'. Scan count 2, logical reads 113, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms,  elapsed time = 3 ms.

Query plan using EXISTS

(Note the extra Stream Aggregate step)

Using INNER JOIN:

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'PERFTEST'. Scan count 2, logical reads 113, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 31 ms,  elapsed time = 25 ms.

Query plan using INNER JOIN

CONCLUSION:

INTERSECT is slightly faster in this case that EXISTS. The INNER JOIN option is way slower.


Solution

  • "what clause is better to put first"

    The optimizer does exactly that for you. Queries are not evaluated as written. EXISTS is translated to a join and undergoes the usual join reordering optimizations. Statistics are being used to drive that process. It's not perfect but usually good.

    Use OPTION (RECOMPILE) to get a plan that is adjusted for the specific search parameters you are using.


    select UserID from T where Condition1
    intersect select UserID from T where Condition2
    intersect select UserID from T where Condition3