Search code examples
sqlsql-servert-sqldata-profiling

Validation for columns work very slow (SQL Server)


I want to perform data profiling on the columns of a table. In this particular case - what percentage of data is date/integer/numeric/bit. The query that I am using:

SELECT 
CAST(SUM(CASE WHEN TRY_CAST([column1] AS date) IS NOT NULL AND TRY_CAST(TRY_CAST([column1] AS VARCHAR(8000)) AS date) between '1950-01-01' AND '2049-12-31' AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentDate,
    CAST(SUM(CASE WHEN TRY_CAST([column1] AS FLOAT) IS NOT NULL AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentNumeric,
    CAST(SUM(CASE WHEN TRY_CAST([column1] AS BIGINT) IS NOT NULL AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentInteger,
    CAST(SUM(CASE WHEN LOWER(TRY_CAST([column1] AS VARCHAR(MAX))) IN ('1', '0', 't', 'f', 'y', 'n', 'true', 'false', 'yes', 'no') THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentBit
    FROM tbl

This query works really slow even if I choose only top 1 row. Actually I am not able to get any result, or at least I cannot wait such a long time. The column that I am checking is of type decimal if this is of any importance.

enter image description here

The number of records in the table is: 37,431,866. This is why I choose only top 1000 for example, but still does not load any result for more than 40 minutes


Solution

  • If you want this to run faster, then you don't want to limit the rows in the query you are using. After all, an aggregation query with no GROUP BY only returns one row.

    Instead use a subquery:

    SELECT . . .
    FROM (SELECT TOP (1000) t.*
          FROM tbl t
         ) t
    

    Note that this is not a random sample. And if you attempt ORDER BY newid() you will kill performance. One alternative to get an approximate n% sample is to use logic like this:

    SELECT . . .
    FROM (SELECT TOP (1000) t.*
          FROM tbl t
          WHERE RAND(CHECKSUM(NEWID())) < 0.001
         ) t
    

    The 0.001 would be about a 0.1% sample.