Search code examples
sql-server-2005stored-proceduresdata-miningdata-scrubbing

SQL Server query to return percentage of null content in a table's fields


I'm looking to scrub data in migration project from a legacy system developed on SQL Server 2005, but the first order of business is to figure out which columns aren't really in use.

The general logic behind my approach is to identify columns that are mostly left blank (i.e., most or all rows contain null values, for that column, in that table). This would be executed as a stored procedure, where the ideal output would be something like:

TABLE: contacts (10000 records)
---------------------------------
FIELD: id | 0 (0%) Null Records
FIELD: username | 0 (0%) Null Records
FIELD: phonenumber | 8,200 (82%) Null Records
FIELD: email | 300 (3%) Null records
FIELD: icq | 9,900 (99%) Null Records
FIELD: zip | 100 (1%) Null Records

Here's the catch: some tables have over 100 columns, so the real key is for the procedure to loop through the columns of a given table, so I don't have to key in a long string of column names to run the query. Any help on how to do this would be great,

Thanks.


Solution

  • You can use the column metadata to create some queries like this:

    select 'select ''' + table_name + ''' table_name, ''' 
    + column_name + ''' column_name, count(*) nullcount '
    + 'from [' + table_name + '] where [' + column_name + '] is null '
    from information_schema.columns
    

    If you run the query above, you will get a list of select queries. Copy paste to a text editor and insert 'union all' between the selects, it will look like this:

    select 'companies' table_name, 'id' column_name, count(*) nullcount from [companies] where [id] is null 
    union all
    select 'companies' table_name, 'code' column_name, count(*) nullcount from [companies] where [code] is null 
    union all
    ...
    

    Then run those unioned selects.

    For list of tables, you can use the same trick with metadata in information_schema.tables.

    Then combine the two lists in excel using vlookup, or build a more complex queries with subqueries using information_schema.tables and information_schema.columns.