Search code examples
sqlsnowflake-cloud-data-platform

How to get a list of tables used for each query in the query history in snowflake


I am doing analysis on table usage. I need to use the query history table and take out all tables that were used in a query. The query itself could use 10 separate tables, or even use the same table multiple times. I need to be able to parse through the query (using a query itself, not python or any other language, just SQL) and take out a list of all tables that the query hit.

Is there any way to do this?

An example of where I am getting the histories would be this query:

select query_text
from table(information_schema.query_history())

Solution

  • An alternative approach using rlike and information_schema.tables.

    You could extend this further by looking at the # rows per table (high = fact, low = dimension) and the number of times accessed.

    select query_text, array_agg(DISTINCT TABLE_NAME::string) 
    from 
     (select top 100 query_text 
      from 
         table(information_schema.query_history())
      where 
         EXECUTION_STATUS = 'SUCCESS' ) a
     left outer join 
     (select TABLE_NAME from INFORMATION_SCHEMA.TABLES group by TABLE_NAME) b
     on 
        upper(a.query_text) rlike '.*('||upper(b.table_name)||').*'
    group by 
       query_text
    

    Extended Version:

    I noticed there's some issues with the above answer. Firstly that it does not allow you to run the explain plan any more than one query at a time. Secondly if the query_id uses a cache it fails to return any objects.

    So be extending my initial answer as follows.

    1. Create a couple views that read all the databases and provide central authority on all tables/views/objects/query_histories.
    2. Run the generated SQL which creates a couple views. It again uses rlike but substitutes database and schema names from the query_history when not present.

    I've added credits, time elapsed to the two views for more extensions.

    You can validate for yourself by checking the explain plan as above and if you don't see identical tables check the SQL and you'll most likely see the cache has been used.

    Would be great to hear if anyone finds this useful.

    enter image description here

    Step 1 Create a couple views :

    show databases;
    select RTRIM( 'create or replace view bob as ( '||listagg('select  CONCAT_WS(\'.\',TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME) table_name_3,CONCAT_WS(\'.\',TABLE_SCHEMA, TABLE_NAME) table_name_2,TABLE_NAME, ROW_COUNT, BYTES from ' ||"name" ||'.INFORMATION_SCHEMA.TABLES union all ') ,' union all')||')' tabs,
    RTRIM( 'create or replace view bobby as ( '||listagg('select QUERY_ID, query_text ,DATABASE_NAME, SCHEMA_NAME, CREDITS_USED_CLOUD_SERVICES , TOTAL_ELAPSED_TIME from table( '||"name" ||'.information_schema.query_history())  where  EXECUTION_STATUS = \'SUCCESS\' union all ') ,' union all')||')' tabs2 
    from table(result_scan( LAST_QUERY_ID()));
    

    Step 2 Run this SQL:

    select
    QUERY_TEXT,
    query_id,
    CREDITS_USED,
    TOTAL_ELAPSED,
    array_agg(TABLE_NAME_3) tables_used
    from
    (
        select
            QUERY_TEXT
            ,query_id
            ,TABLE_NAME
            , rlike(  (a.query_text) ,  '.*(\\s.|\\.){1}('||(bob.TABLE_NAME)||'(\\s.*|$))','is')   aa
            , rlike(  (a.query_text) ,  '.*(\\s.|\\.){1}('||(bob.TABLE_NAME_2)||'(\\s.*|$))','is')  bb
            , rlike(  (a.query_text) ,  '.*(\\s.){1}('||(bob.TABLE_NAME_3)||'(\\s.*|$))','is')     cc,
            bob.TABLE_NAME_3,
            count(1) cnt,
            max(CREDITS_USED_CLOUD_SERVICES) CREDITS_USED,
            max(TOTAL_ELAPSED_TIME) TOTAL_ELAPSED
        from
           BOBBY a
       left outer join 
           BOB 
         on 
            rlike(  (a.query_text) ,  '.*(\\s.|\\.){1}('||(bob.TABLE_NAME)||'(\\s.*|$))','is')    
         or rlike(  (a.query_text) ,  '.*(\\s.|\\.){1}('||(bob.TABLE_NAME_2)||'(\\s.*|$))','is')    
         or rlike(  (a.query_text) ,  '.*(\\s.|\\.){1}('||(bob.TABLE_NAME_3)||'(\\s.*|$))','is')    
        where 
            TABLE_NAME is not null 
            and (  cc
                 or    iff(bb, upper( DATABASE_NAME||'.'||TABLE_NAME) = bob.TABLE_NAME_3, false)
                 or    iff(aa, upper (DATABASE_NAME||'.'||SCHEMA_NAME||'.'||TABLE_NAME) = bob.TABLE_NAME_3, false)   
                )
        group by 
           1,2,3,4,5,6,7) 
     group by 
        1,2,3,4;