Search code examples
sqlunionprestodata-transform

Optimize the run time of Presto SQL Data Transformation


I have a website data, where I track the number of pages that are viewed and a video content that's completely viewed. I have no issues with pulling the data. However, I need to put this into a dashboard and trying to transform the data so that there's a column that would track the content name, other columns to track Viewed and Completed.

Like this:

enter image description here

The Original query to pull the data takes around 3 minutes. But when I updated it to get the transformation:

select distinct site_name, date, page_name, 'View' as content-type, content1_View as content_name, launch_count
from report_qry

UNION

select distinct site_name, date, page_name, 'Complete' as content-type, content1_complete as content_name, launch_count
from report_qry

After adding the unions, the query is now taking 15 Minutes to complete. Is there any workaround or a better way to do this?

The original query gets data from around 10 million records also there's a good chance that the number of contents will increase. Now I just have 2.


Solution

  • If I read your description correctly, the report_qry is a CTE and it took 3 mins to complete.

    Your query can be optimized by the following steps: Step 1. UNION -> UNION ALL

    UNION means UNION DISTINCT, your subqueries guarante content-types are different between them ('View' VS 'Complete'). So we don't need UNION DISTINCT here. Write it explicitly as UNION ALL will save the distinct step when having union operation on subqueries.

    select distinct site_name, date, page_name, 'View' as content-type, content1_View as content_name, launch_count
    from report_qry
    
    UNION ALL
    
    select distinct site_name, date, page_name, 'Complete' as content-type, content1_complete as content_name, launch_count
    from report_qry
    

    Step 2. Avoid computing report_qry twice by using UNNEST()

    report_qry is a CTE, it appears twice in the query and are computed twice. we can rewrite the query by using UNNEST() to avoid that

    select distinct 
        site_name, 
        date, 
        page_name, 
        content-type, 
        case when content-type = 'View' then content1_View
             when content-type = 'Complete' then content1_complete 
        end as content_name, 
        launch_count
    from report_qry 
    cross join unnest(ARRAY['View','Complete']) as t(content-type)
    

    Step 3 (Optional). Verify the data to see whether select distinct is necessary. If it is not necessary, remove distinct operator.