Search code examples
sqlpostgresqlquery-optimizationgreenplum

What are the possible ways to optimize the below postgreSQL code?


I have written this SQL query to fetch the data from greenplum datalake. The primary table has hardy 800,000ish rows which I am joining with other table. The below query is taking insane amount of time to give result. What might be the possible reason for the longer query time? How to resolve it?

select    
          a.pole,
          t.country_name,
          a.service_area,  
          a.park_name,

          t.turbine_platform_name,
          a.turbine_subtype,
          a.pad as "turbine_name",
          t.system_number as "turbine_id",
          a.customer,
          a.service_contract,   

          a.component,
          c.vendor_mfg as "component_manufacturer",

          a.case_number,
          a.description as "case_description",
          a.rmd_diagnosis as "case_rmd_diagnostic_description",
          a.priority as "case_priority",
          a.status as "case_status",
          a.actual_rootcause as "case_actual_rootcause",
          a.site_trends_feedback as "case_site_feedback",
          a.added as "date_case_added",
          a.start as "date_case_started",
          a.last_flagged as "date_case_flagged_by_algorithm_latest",
          a.communicated as "date_case_communicated_to_field",
          a.field_visible_date as "date_case_field_visbile_date",
          a.fixed as "date_anamoly_fixed",
          a.expected_clse as "date_expected_closure",
          a.request_closure_date as "date_case_request_closure",
          a.validation_date as "date_case_closure",
          a.production_related,
          a.estimated_value as "estimated_cost_avoidance",
          a.cms,
          a.anomaly_category,
          a.additional_information as "case_additional_information",
          a.model,
          a.full_model,
          a.sent_to_field as "case_sent_to_field"

      from app_pul.anomaly_stage a
 left join ge_cfg.turbine_detail t on a.scada_number = t.system_number and a.added > '2017-12-31'
 left join tbwgr_v.pmt_wmf_tur_component_master_t c on a.component = c.component_name

Solution

  • Your query is basically:

     select . . .
     from app_pul.anomaly_stage a left join
          ge_cfg.turbine_detail t 
          on a.scada_number = t.system_number and
             a.added > '2017-12-31' left join
             tbwgr_v.pmt_wmf_tur_component_master_t c
             on a.component = c.component_name
    

    First, the condition on a is ignored, because it is the first table in the left join and is the on clause. So, I assume you actually intend for it to filter, so write the query as:

     select . . .
     from app_pul.anomaly_stage a left join
          ge_cfg.turbine_detail t 
          on a.scada_number = t.system_number left join
          tbwgr_v.pmt_wmf_tur_component_master_t c
          on a.component = c.component_name
     where a.added > '2017-12-31'
    

    That might help with performance. Then in Postgres, you would want indexes on turbine_detail(system_number) and pmt_wmf_tur_component_master_t(component_name). It is doubtful that an index would help on the first table, because you are already selecting a large amount of data.

    I'm not sure if indexes would be appropriate in Greenplum.