I am working on a system where any number of companies can upload Small and Medium Enterprise information into a database and search it.
The table import_data
stores SME information, uploaded by different companies.
The information uploaded by each company is identified by the column company_id
of the table import_data
.
CREATE TABLE import_data
(
id uuid NOT NULL, # Indexed
company_id integer NOT NULL, # Indexed
business_name character varying, # Indexed
no_of_rating double precision, # Indexed
no_of_reviews double precision, # Indexed
phone character varying, # Indexed
email character varying, # Indexed
address character varying, # Indexed
upserted_by integer, # Not indexed
upsert_date timestamp without time zone, # Not indexed
)
PARTITION BY LIST (company_id);
Now I am working on a recommendation system, where a record whose phone number is not present in the data uploaded by a particular company, but is present in data uploaded by another company, is recommended to the first company to buy.
The relation between the company and all import_data records which it owns is stored in another table import_data_company_rel
. This table includes all the import_data
records which are uploaded and bought by a particular company.
CREATE TABLE import_data_company_rel
(
record_id uuid NOT NULL, # Indexed
company_id integer NOT NULL # Indexed
)
PARTITION BY LIST (company_id);
Both tables are partitioned so that each company will be storing their data in their own partitions.
For example, the company with id 1 will be storing their data in import_data_1, import_data_company_rel_1 and the company with id 2 will be storing their data in import_data_2, import_data_company_rel_2 having the same schema of tables import_data and import_data_company_rel.
This is the SQL query that is used to fetch the recommended records for a particular company, let's say with an id of 1.
SELECT id,business_name,address
FROM import_data
INNER JOIN import_data_company_rel idcr ON import_data.id = idcr.record_id
WHERE idcr.company_id != 1
AND import_data.phone NOT IN (SELECT import_data.phone
FROM import_data
INNER JOIN import_data_company_rel idcr
ON import_data.id = idcr.record_id
WHERE idcr.company_id = 1)
LIMIT 100 OFFSET 0;
This query fetches first-page recommendation results, with records of phones not included in the records of company 1.
Although this works properly for a small number of records, this query doesn't scale well with the number of records.
When the number of records is in millions, it doesn't even complete the execution even after an hour just to fetch 100 records.
I found that the part of the query that slows the execution is the subquery
SELECT import_data.phone
FROM import_data
INNER JOIN import_data_company_rel idcr
ON import_data.id = idcr.record_id
WHERE idcr.company_id = 1
which selects all the phone numbers already owned by the company with id 1, to be filtered from the recommendations. This query will be scanning the entire table(s) row by row.
I don't think partitioning the table has anything to do with the slow query as the query was slow even without partitioning.
explain (analyze, buffers, format text)
result can be seen at https://explain.depesz.com/s/MBUL. I'm working on postgesql v13.2
How to speed up the query? Is it even possible? Any help is appreciated. Thanks.
You can do two things to improve performance:
Improve the speed of the index-only scan by
VACUUM import_data_company_rel_1;
This will get rid of the expensive heap fetches.
Rewrite your query to use NOT EXISTS
:
import_data.phone NOT IN
(SELECT import_data.phone
FROM import_data
INNER JOIN import_data_company_rel idcr
ON import_data.id = idcr.record_id
WHERE idcr.company_id = 1)
can be rewritten to
NOT EXISTS
(SELECT 1
FROM import_data
INNER JOIN import_data_company_rel idcr
ON import_data.id = idcr.record_id
WHERE idcr.company_id = 1
AND import_data.phone = import_data.phone)