Search code examples
sqldata-analysisazure-databricks

optimize query taking too long to run


I am trying to run a SQL query in azure databricks. But this query is taking too long to execute. Could you please suggest what optimization I can do in this below query to make it more faster

SELECT 
LOWER(get_json_object(element_at(params,'ans.client_payload'), '$.ACTIONBLOCK_ID')) AS content_id,
lower(element_at(params,'event.type')) as event_type,
count(distinct(userguid)) as num_usr,
count(*) as num_events

FROM my_table

where event_date >= '2023-04-03'
AND event_code = 'ABC_SERVICE'
AND lower(element_at(params,'event.subcategory')) =  'message'
AND lower(element_at(params,'event.type')) in ('render', 'click')
AND element_at(params,'consumer.name') in ('ABC', 'BCC')
AND element_at(params,'source.name') =  'NNC'
and regexp_like(element_at(params,'exp.campaign_id'), '\b12345\b')
and userguid is not NULL
GROUP BY 
LOWER(get_json_object(element_at(params,'ans.client_payload'), '$.ACTIONBLOCK_ID')),
lower(element_at(params,'event.type'))

I am trying to execute this query but this is taking too long to execute. looking for suggestions to optimize this query.


Solution

    • try to reduce the amount of data that needs to be scanned (select only the columns that are needed + subqueries to filter data before aggregating it
    SELECT 
      content_id,
      event_type,
      count(distinct(userguid)) as num_usr,
      count(*) as num_events
    FROM (
      SELECT 
        LOWER(get_json_object(element_at(params,'ans.client_payload'), '$.ACTIONBLOCK_ID')) AS content_id,
        lower(element_at(params,'event.type')) as event_type,
        userguid
      FROM my_table
      WHERE 
        event_date >= '2023-04-03'
        AND event_code = 'ABC_SERVICE'
        AND lower(element_at(params,'event.subcategory')) =  'message'
        AND lower(element_at(params,'event.type')) in ('render', 'click')
        AND element_at(params,'consumer.name') in ('ABC', 'BCC')
        AND element_at(params,'source.name') =  'NNC'
        and regexp_like(element_at(params,'exp.campaign_id'), '\b12345\b')
        and userguid is not NULL
    )
    GROUP BY 
      content_id,
      event_type
    
    • make sure that table has appropriate indexes defined

    you may want to create an index on event_date event_code element_at(params,'event.subcategory') element_at(params,'event.type') element_at(params,'consumer.name') element_at(params,'source.name') element_at(params,'exp.campaign_id') userguid

    ps. be mindful of the size of the index(you don't want it to be too big)