I am doing analysis on table usage. I need to use the query history table and take out all tables that were used in a query. The query itself could use 10 separate tables, or even use the same table multiple times. I need to be able to parse through the query (using a query itself, not python or any other language, just SQL) and take out a list of all tables that the query hit.
Is there any way to do this?
An example of where I am getting the histories would be this query:
select query_text
from table(information_schema.query_history())
An alternative approach using rlike and information_schema.tables.
You could extend this further by looking at the # rows per table (high = fact, low = dimension) and the number of times accessed.
select query_text, array_agg(DISTINCT TABLE_NAME::string)
from
(select top 100 query_text
from
table(information_schema.query_history())
where
EXECUTION_STATUS = 'SUCCESS' ) a
left outer join
(select TABLE_NAME from INFORMATION_SCHEMA.TABLES group by TABLE_NAME) b
on
upper(a.query_text) rlike '.*('||upper(b.table_name)||').*'
group by
query_text
Extended Version:
I noticed there's some issues with the above answer. Firstly that it does not allow you to run the explain plan any more than one query at a time. Secondly if the query_id uses a cache it fails to return any objects.
So be extending my initial answer as follows.
I've added credits, time elapsed to the two views for more extensions.
You can validate for yourself by checking the explain plan as above and if you don't see identical tables check the SQL and you'll most likely see the cache has been used.
Would be great to hear if anyone finds this useful.
Step 1 Create a couple views :
show databases;
select RTRIM( 'create or replace view bob as ( '||listagg('select CONCAT_WS(\'.\',TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME) table_name_3,CONCAT_WS(\'.\',TABLE_SCHEMA, TABLE_NAME) table_name_2,TABLE_NAME, ROW_COUNT, BYTES from ' ||"name" ||'.INFORMATION_SCHEMA.TABLES union all ') ,' union all')||')' tabs,
RTRIM( 'create or replace view bobby as ( '||listagg('select QUERY_ID, query_text ,DATABASE_NAME, SCHEMA_NAME, CREDITS_USED_CLOUD_SERVICES , TOTAL_ELAPSED_TIME from table( '||"name" ||'.information_schema.query_history()) where EXECUTION_STATUS = \'SUCCESS\' union all ') ,' union all')||')' tabs2
from table(result_scan( LAST_QUERY_ID()));
Step 2 Run this SQL:
select
QUERY_TEXT,
query_id,
CREDITS_USED,
TOTAL_ELAPSED,
array_agg(TABLE_NAME_3) tables_used
from
(
select
QUERY_TEXT
,query_id
,TABLE_NAME
, rlike( (a.query_text) , '.*(\\s.|\\.){1}('||(bob.TABLE_NAME)||'(\\s.*|$))','is') aa
, rlike( (a.query_text) , '.*(\\s.|\\.){1}('||(bob.TABLE_NAME_2)||'(\\s.*|$))','is') bb
, rlike( (a.query_text) , '.*(\\s.){1}('||(bob.TABLE_NAME_3)||'(\\s.*|$))','is') cc,
bob.TABLE_NAME_3,
count(1) cnt,
max(CREDITS_USED_CLOUD_SERVICES) CREDITS_USED,
max(TOTAL_ELAPSED_TIME) TOTAL_ELAPSED
from
BOBBY a
left outer join
BOB
on
rlike( (a.query_text) , '.*(\\s.|\\.){1}('||(bob.TABLE_NAME)||'(\\s.*|$))','is')
or rlike( (a.query_text) , '.*(\\s.|\\.){1}('||(bob.TABLE_NAME_2)||'(\\s.*|$))','is')
or rlike( (a.query_text) , '.*(\\s.|\\.){1}('||(bob.TABLE_NAME_3)||'(\\s.*|$))','is')
where
TABLE_NAME is not null
and ( cc
or iff(bb, upper( DATABASE_NAME||'.'||TABLE_NAME) = bob.TABLE_NAME_3, false)
or iff(aa, upper (DATABASE_NAME||'.'||SCHEMA_NAME||'.'||TABLE_NAME) = bob.TABLE_NAME_3, false)
)
group by
1,2,3,4,5,6,7)
group by
1,2,3,4;