Search code examples
sqlpostgresqlperformancesubquery

Speed up SQL subquery


I am working on a system where any number of companies can upload Small and Medium Enterprise information into a database and search it.

The table import_data stores SME information, uploaded by different companies. The information uploaded by each company is identified by the column company_id of the table import_data.

CREATE TABLE import_data 
(
    id uuid NOT NULL,    # Indexed
    company_id integer NOT NULL,    # Indexed
    business_name character varying,    # Indexed
    no_of_rating double precision,    # Indexed
    no_of_reviews double precision,    # Indexed
    phone character varying,    # Indexed
    email character varying,    # Indexed
    address character varying,    # Indexed
    upserted_by integer,  # Not indexed
    upsert_date timestamp without time zone,    # Not indexed
)
PARTITION BY LIST (company_id);

Now I am working on a recommendation system, where a record whose phone number is not present in the data uploaded by a particular company, but is present in data uploaded by another company, is recommended to the first company to buy. The relation between the company and all import_data records which it owns is stored in another table import_data_company_rel. This table includes all the import_data records which are uploaded and bought by a particular company.

CREATE TABLE import_data_company_rel 
(
    record_id uuid NOT NULL,    # Indexed
    company_id integer NOT NULL    # Indexed
)
PARTITION BY LIST (company_id);

Both tables are partitioned so that each company will be storing their data in their own partitions.

For example, the company with id 1 will be storing their data in import_data_1, import_data_company_rel_1 and the company with id 2 will be storing their data in import_data_2, import_data_company_rel_2 having the same schema of tables import_data and import_data_company_rel.

This is the SQL query that is used to fetch the recommended records for a particular company, let's say with an id of 1.

SELECT id,business_name,address
FROM import_data 
INNER JOIN import_data_company_rel idcr ON import_data.id = idcr.record_id
WHERE idcr.company_id != 1 
  AND import_data.phone NOT IN (SELECT import_data.phone 
                                FROM import_data 
                                INNER JOIN import_data_company_rel idcr
                                      ON import_data.id = idcr.record_id 
                                WHERE idcr.company_id = 1)
LIMIT 100 OFFSET 0;

This query fetches first-page recommendation results, with records of phones not included in the records of company 1.

Although this works properly for a small number of records, this query doesn't scale well with the number of records.

When the number of records is in millions, it doesn't even complete the execution even after an hour just to fetch 100 records.

I found that the part of the query that slows the execution is the subquery

SELECT import_data.phone 
FROM import_data 
INNER JOIN import_data_company_rel idcr
      ON import_data.id = idcr.record_id 
WHERE idcr.company_id = 1

which selects all the phone numbers already owned by the company with id 1, to be filtered from the recommendations. This query will be scanning the entire table(s) row by row.

I don't think partitioning the table has anything to do with the slow query as the query was slow even without partitioning.

explain (analyze, buffers, format text)

result can be seen at https://explain.depesz.com/s/MBUL. I'm working on postgesql v13.2

How to speed up the query? Is it even possible? Any help is appreciated. Thanks.


Solution

  • You can do two things to improve performance:

    1. Improve the speed of the index-only scan by

      VACUUM import_data_company_rel_1;
      

      This will get rid of the expensive heap fetches.

    2. Rewrite your query to use NOT EXISTS:

      import_data.phone NOT IN
          (SELECT import_data.phone
           FROM import_data
              INNER JOIN import_data_company_rel idcr
                 ON import_data.id = idcr.record_id
           WHERE idcr.company_id = 1)
      

      can be rewritten to

      NOT EXISTS
          (SELECT 1
           FROM import_data
              INNER JOIN import_data_company_rel idcr
                 ON import_data.id = idcr.record_id
           WHERE idcr.company_id = 1
             AND import_data.phone = import_data.phone)