Search code examples
amazon-redshiftquery-optimization

Speeding up postgres / Redshift queries


I have been given a requirement to query 8(+) different Redshift tables from a Lambda function and return some of the fields where email = provided email address.

Current implementation is a JDBC connection using Spring Data JDBC :

@Query("select * from
 (select product_name, email, firstname, lastname, accountid, phone, 
   from companyA) companyA 
 natural full join
 (select product_name, email, firstname, lastname, accountid, phone, 
   from companyB) companyB 
 natural full join
 (select product_name, email, firstname, lastname, accountid, phone, 
   from companyC) companyC 
 natural full join
 (select product_name, email, firstname, lastname, accountid, phone, 
   from companyD) companyD 
 natural full join
 (select product_name, email, firstname, lastname, accountid, phone, 
   from companyE) companyE 
 natural full join
 (select product_name, email, firstname, lastname, accountid, phone, 
   from companyF) companyF 
 natural full join
 (select product_name, email, firstname, lastname, accountid, phone, 
   from companyG) companyG 
 natural full join
 (select product_name, email, firstname, lastname, accountid, phone, 
   from companyH) companyH 
 where email = :emailId;")
public List<Contact> getContactSortByDate(String emailId); 

The problem with this approach is that each of these tables are quite large and so the query can take 45+ seconds. To make it worse, I will eventually need to implement a one to many join on "accountid" to 8 account tables to grab some fields there.

And... my boss wants all of this to happen in less than 5 seconds... I don't think that will be possible, but I wanted to know if there were things I can do to speed this search up as much as possible?

Questions:

  • is there a way to make this query faster?
  • If I am able to switch to using the Redshift client from inside the VPC instead of a JDBC connection, is there still any hope of getting below (or anywhere near) 5 seconds?

EDIT: As requested, here is the Updated results of '''EXPLAIN```:

XN Subquery Scan alltables  (cost=35.45..142252362.72 rows=35 width=520)
  ->  XN Append  (cost=35.45..142252362.37 rows=35 width=1440)
        ->  XN Subquery Scan "*SELECT* 1"  (cost=35.45..67200090.91 rows=11 width=1440)
              ->  XN Hash Join DS_BCAST_INNER  (cost=35.45..67200090.80 rows=11 width=1440)
                    Hash Cond: (("outer".id)::text = ("inner".accountid)::text)
                    ->  XN Seq Scan on companyA_accounts acc  (cost=0.00..4.91 rows=491 width=816)
                    ->  XN Hash  (cost=35.43..35.43 rows=10 width=663)
                          ->  XN Seq Scan on companyA usr  (cost=0.00..35.43 rows=10 width=663)
                                Filter: ((email)::text = '[email protected]'::text)
        ->  XN Subquery Scan "*SELECT* 2"  (cost=41.85..41600165.37 rows=11 width=943)
              ->  XN Hash Join DS_BCAST_INNER  (cost=41.85..41600165.26 rows=11 width=943)
                    Hash Cond: (("outer".id)::text = ("inner".accountid)::text)
                    ->  XN Seq Scan on companyB_accounts acc  (cost=0.00..10.96 rows=1096 width=550)
                    ->  XN Hash  (cost=41.83..41.83 rows=10 width=406)
                          ->  XN Seq Scan on companyB usr  (cost=0.00..41.83 rows=10 width=406)
                                Filter: ((email)::text = '[email protected]'::text)
        ->  XN Subquery Scan "*SELECT* 3"  (cost=512.54..6241501.45 rows=2 width=1374)
              ->  XN Hash Join DS_BCAST_INNER  (cost=512.54..6241501.43 rows=2 width=1374)
                    Hash Cond: (("outer".id)::text = ("inner".accountid)::text)
                    ->  XN Seq Scan on companyC_accounts acc  (cost=0.00..439.50 rows=43950 width=771)
                    ->  XN Hash  (cost=512.54..512.54 rows=1 width=614)
                          ->  XN Seq Scan on companyC usr  (cost=0.00..512.54 rows=1 width=614)
                                Filter: ((email)::text = '[email protected]'::text)
        ->  XN Subquery Scan "*SELECT* 4"  (cost=2178.78..4403808.58 rows=3 width=985)
              ->  XN Hash Join DS_BCAST_INNER  (cost=2178.78..4403808.55 rows=3 width=985)
                    Hash Cond: (("outer".id)::text = ("inner".accountid)::text)
                    ->  XN Seq Scan on companyD_accounts acc  (cost=0.00..501.46 rows=50146 width=809)
                    ->  XN Hash  (cost=2178.78..2178.78 rows=2 width=212)
                          ->  XN Seq Scan on companyD usr  (cost=0.00..2178.78 rows=2 width=212)
                                Filter: (((email)::text = '[email protected]'::text) AND (accountid IS NOT NULL))
        ->  XN Subquery Scan "*SELECT* 5"  (cost=3534.94..4244248.75 rows=2 width=511)
              ->  XN Hash Join DS_BCAST_INNER  (cost=3534.94..4244248.73 rows=2 width=511)
                    Hash Cond: (("outer".id)::text = ("inner".accountid)::text)
                    ->  XN Seq Scan on companyE_accounts acc  (cost=0.00..219.62 rows=21962 width=347)
                    ->  XN Hash  (cost=3534.94..3534.94 rows=2 width=203)
                          ->  XN Seq Scan on companyE usr  (cost=0.00..3534.94 rows=2 width=203)
                                Filter: (((email)::text = '[email protected]'::text) AND (accountid IS NOT NULL))
        ->  XN Subquery Scan "*SELECT* 6"  (cost=810.77..1921107.33 rows=1 width=1175)
              ->  XN Hash Join DS_BCAST_INNER  (cost=810.77..1921107.32 rows=1 width=1175)
                    Hash Cond: (("outer".id)::text = ("inner".accountid)::text)
                    ->  XN Seq Scan on companyF_accounts acc  (cost=0.00..131.80 rows=13180 width=1030)
                    ->  XN Hash  (cost=810.76..810.76 rows=1 width=184)
                          ->  XN Seq Scan on companyF usr  (cost=0.00..810.76 rows=1 width=184)
                                Filter: ((email)::text = '[email protected]'::text)
        ->  XN Subquery Scan "*SELECT* 7"  (cost=705.19..9200982.48 rows=3 width=1204)
              ->  XN Hash Join DS_BCAST_INNER  (cost=705.19..9200982.45 rows=3 width=1204)
                    Hash Cond: (("outer".id)::text = ("inner".accountid)::text)
                    ->  XN Seq Scan on companyG_accounts acc  (cost=0.00..85.30 rows=8530 width=790)
                    ->  XN Hash  (cost=705.19..705.19 rows=2 width=449)
                          ->  XN Seq Scan on companyG usr  (cost=0.00..705.19 rows=2 width=449)
                                Filter: ((email)::text = '[email protected]'::text)
        ->  XN Subquery Scan "*SELECT* 8"  (cost=420.21..7440457.49 rows=2 width=1135)
              ->  XN Hash Join DS_BCAST_INNER  (cost=420.21..7440457.47 rows=2 width=1135)
                    Hash Cond: (("outer".id)::text = ("inner".accountid)::text)
                    ->  XN Seq Scan on companyH_accounts acc  (cost=0.00..11.46 rows=1146 width=784)
                    ->  XN Hash  (cost=420.20..420.20 rows=2 width=362)
                          ->  XN Seq Scan on companyH usr  (cost=0.00..420.20 rows=2 width=362)
                                Filter: ((email)::text = '[email protected]'::text)

Solution: As O.Jones pointed out, 'UNION ALL' sped it up by quite a bit! I still have a bunch of joins, but brought the time down by a lot. Here is the final solution:

SELECT * 
  FROM (
     SELECT usr.product_name, acc.product_loc, acc.phone, usr.email, usr.firstname, usr.lastname, usr.accountid,  
       FROM companyA usr
         JOIN companyA_accounts acc ON usr.accountid = acc.id
     UNION ALL 
    SELECT usr.product_name, acc.product_loc, acc.phone, usr.email, usr.firstname, usr.lastname, usr.accountid,  
      FROM companyB usr
         JOIN companyB_accounts acc ON usr.accountid = acc.id
     /* UNION ALL SELECT repeated for the rest of the tables */
  ) alltables
 WHERE alltables.email =  :emailId;

I am still open to information on best practices / efficiency if switching to RedshiftClient though!


Solution

  • Have you tried something like this query, using an ordinary UNION ALL rather than all those JOINs?

    SELECT * 
      FROM (
         SELECT product_name, email, firstname, lastname, accountid, phone, 
           FROM companyA
         UNION ALL 
        SELECT product_name, email, firstname, lastname, accountid, phone, 
          FROM companyB
         /* UNION ALL SELECT repeated for the rest of the tables */
      ) alltables
      JOIN some_other_table ON alltables.accountid = some_other_table.accountid
     WHERE alltables.email =  :emailId;
    

    A query like this can exploit any indexes on the email columns of any tables that have them.

    And, as my example shows you can do the joins with fairly clean SQL.