oracle-database fuzzy-search fuzzy-logic fuzzy

Need to roughly group similar query executions in Oracle that have slightly different constraints

Working on assessing impact of some current databases planned retirement. It's not feasible for individual communication with users that have accessed impacted data recently due to volume.

I'm thinking that if I can do some form of fuzzy logic lookup to group like queries by user, at a minimum I can identify reoccuring queries that differ slightly because of expected constraint changes. Although far from perfect, this could help represent queries that are ran regularly to support reoccurring business functions vs pure ad-hoc.

Can anyone provide some ideas that can get me started or let me know if there are any alternate ideas to research given my goals stated above?

Solution

You can combine UTL_MATCH, DBA_HIST_SQLTEXT, and DBA_HIST_SQLSTAT to find similar query executions. If you haven't licensed AWR, or are only interested in recent queries, you can use GV$SQLSTATS instead of the DBA_HIST tables.

In addition to being complicated, you will need to adjust some of the literals in the below query based on trial and error. Currently, it only looks at the top 10 most executed query per user, and only finds the top 5 most related queries where the similarity score is greater than or equal to 60%.

--Common queries and the top 5 most-closely related queries.
with statements as
(
    --All relevant SQL statements
    select
        sqlstats.parsing_schema_name,
        sqlstats.total_executions,
        sqltext.sql_id,
        --Convert CLOB to VARCHAR for UTL_MATCH.
        --Won't matter, since we're only interseted in fuzzy matches anyway.
        to_char(substr(sqltext.sql_text, 1, 1000)) sql_text,
        sqltext.command_type
    from
    (
        --All queries in AWR.
        select sql_id, sql_text, command_type
        from dba_hist_sqltext
    ) sqltext
    join
    (
        --Statistics for all queries in AWR.
        select sql_id, parsing_schema_name, sum(executions_delta) total_executions
        from dba_hist_sqlstat
        group by sql_id, parsing_schema_name
    ) sqlstats
        on sqltext.sql_id = sqlstats.sql_id
    order by parsing_schema_name, total_executions desc
)
--Top N most similar queries.
select *
from
(
    --Ranked similarity.
    select
        similarity.*,
        row_number() over (partition by sql_id1 order by similarity desc) top_similarity
    from
    (
        --Similarity between SQL statements for the Top N SQL and other SQL run by the same user.
        select
            top_n.parsing_schema_name, top_n.sql_id sql_id1, top_n.sql_text sql_text1, top_n.total_executions,
            statements.sql_id sql_id2, statements.sql_text sql_text2,
            utl_match.edit_distance_similarity(top_n.sql_text, statements.sql_text) similarity
        from
        (
            --Top N most executed queries.
            select *
            from
            (
                --Most executed queries per user.
                select
                    statements.*,
                    row_number () over (partition by parsing_schema_name order by total_executions desc) top_n 
                from statements
                order by parsing_schema_name, total_executions desc
            )
            where top_n <= 10
        ) top_n
        join statements
            on top_n.parsing_schema_name = statements.parsing_schema_name
            and top_n.command_type = statements.command_type
            and top_n.sql_id <> statements.sql_id
        order by top_n.sql_id, similarity desc, statements.sql_id
    ) similarity
) ranked_similarity
where top_similarity <= 5
    and similarity >= 60
order by parsing_schema_name, sql_id1, top_similarity;