database performance postgresql indexing query-optimization

Slow query with distinct/group by on varchar column with Postgres

I have a company table and an industry table, with a many-to-many relation table linking the two, named company_industry. The company table currently has approximately 750.000 rows.

Database schema

Now I need a query that finds all of the unique city names for a given industry, in which there are at least one company. So basically I have to find all companies that are associated with a given industry and select the unique city names for these companies.

I can write queries that do this just fine, but not with the performance that I am looking for. In advance I was a bit skeptical about the performance because the city_name column is of the type VARCHAR. Unfortunately I do currently not have the liberty of being able to change the database schema to something more normalized.

The first thing I did was to add an index on the city_name column, and then I tried the below queries.

SELECT c.city_name AS city
FROM industry AS i 
INNER JOIN company_industry AS ci ON (ci.industry_id = i.id)
INNER JOIN company AS c ON (c.id = ci.company_id)
WHERE i.id = 288
GROUP BY city;

The above query takes around two seconds to execute on average. The same is the case when replacing GROUP BY with DISTINCT. Below is the execution plan of the above query.

HashAggregate  (cost=56934.21..56961.61 rows=2740 width=9) (actual time=2421.364..2421.921 rows=1962 loops=1)
  ->  Hash Join  (cost=38972.69..56902.50 rows=12687 width=9) (actual time=954.377..2411.194 rows=12401 loops=1)
        Hash Cond: (ci.company_id = c.id)
        ->  Nested Loop  (cost=0.28..13989.91 rows=12687 width=4) (actual time=0.041..203.442 rows=12401 loops=1)
              ->  Index Only Scan using industry_pkey on industry i  (cost=0.28..8.29 rows=1 width=4) (actual time=0.015..0.018 rows=1 loops=1)
                    Index Cond: (id = 288)
                    Heap Fetches: 0
              ->  Seq Scan on company_industry ci  (cost=0.00..13854.75 rows=12687 width=8) (actual time=0.020..199.087 rows=12401 loops=1)
                    Filter: (industry_id = 288)
                    Rows Removed by Filter: 806309
        ->  Hash  (cost=26036.52..26036.52 rows=744152 width=13) (actual time=954.113..954.113 rows=744152 loops=1)
              Buckets: 4096  Batches: 64  Memory Usage: 551kB
              ->  Seq Scan on company c  (cost=0.00..26036.52 rows=744152 width=13) (actual time=0.008..554.662 rows=744152 loops=1)
Total runtime: 2422.185 ms

I tried to change the query to use a subquery as below, which made the query roughly twice as fast.

SELECT c.city_name
FROM company AS c
WHERE EXISTS(
  SELECT 1
  FROM company_industry
  WHERE industry_id = 288 AND company_id = c.id
)
GROUP BY c.city_name;

And the execution plan for this query:

HashAggregate  (cost=47108.71..47136.11 rows=2740 width=9) (actual time=1270.171..1270.798 rows=1962 loops=1)
  ->  Hash Semi Join  (cost=14015.50..47076.98 rows=12690 width=9) (actual time=194.548..1251.785 rows=12401 loops=1)
        Hash Cond: (c.id = company_industry.company_id)
        ->  Seq Scan on company c  (cost=0.00..26036.52 rows=744152 width=13) (actual time=0.008..537.856 rows=744152 loops=1)
        ->  Hash  (cost=13856.88..13856.88 rows=12690 width=4) (actual time=194.399..194.399 rows=12401 loops=1)
              Buckets: 2048  Batches: 1  Memory Usage: 436kB
              ->  Seq Scan on company_industry  (cost=0.00..13856.88 rows=12690 width=4) (actual time=0.012..187.449 rows=12401 loops=1)
                    Filter: (industry_id = 288)
                    Rows Removed by Filter: 806309
Total runtime: 1271.030 ms

That's better, but hopefully you guys can help me do better.

Basically, the expensive part of the query seems to be finding the unique city names (as expected), and even with an index on the column, the performance is not quite good enough. I am quite rusty in regards to analyzing execution plans, but I included them so you guys can see exactly what is happening.

What can I do to retrieve this data faster?

I am using Postgres 9.3.5, DDL below:

CREATE TABLE company (
  id SERIAL PRIMARY KEY NOT NULL,
  name VARCHAR(150) NOT NULL,
  city_name VARCHAR(50),
);

CREATE TABLE company_industry (
  company_id INT NOT NULL REFERENCES company (id) ON UPDATE CASCADE,
  industry_id INT NOT NULL REFERENCES industry (id) ON UPDATE CASCADE,
  PRIMARY KEY (company_id, industry_id)
);

CREATE TABLE industry (
  id SERIAL PRIMARY KEY NOT NULL,
  name VARCHAR(100) NOT NULL
);

CREATE INDEX company_city_name_index ON company (city_name);

Solution

There is a Seq Scan on company_industry in both query plans that should really be a (bitmap) index scan. The same goes for Seq Scan on company.

Seems to be a issue of missing indexes - or something is not right in your db. If something seems wrong, draw a backup before you proceed. Check if cost settings and statistics are valid:

Keep PostgreSQL from sometimes choosing a bad query plan

If settings are good, I would then check the relevant indices (as detailed below). Maybe REINDEX can fix it:

REINDEX TABLE company;
REINDEX TABLE company_industry;

Maybe you need to do more:

Optimize Postgres query on timestamp range

Also, you can simplify the query:

SELECT c.city_name AS city
FROM   company_industry ci
JOIN   company          c ON c.id = ci.company_id
WHERE  ci.industry_id = 288
GROUP  BY 1;

Notes

If your PK constraint is on (company_id, industry_id) add another (unique) index on (industry_id, company_id) (reversed order!). Why?

Is a composite index also good for queries on the first field?

Seq Scan on company is equally bothersome. It seems like there is no index on company(id), but your ER diagram indicates a PK, so that cannot be?
The fastest option would be to have a multicolumn index on (id, city_name) - if (and only if) you get index-only scans out of it.

Since you already have the id of the given industry, you don't need to include the table industry table at all.

No need for parentheses around the expression(s) in the ON clause.

This is unfortunate:

Unfortunately I do currently not have the liberty of being able to change the database schema to something more normalized.

Your simple schema makes sense for small tables with little redundancy and hardly any strain on available cache memory. But city names are probably highly redundant in big tables. Normalizing would shrink table and index sizes considerably, which is the most important factor for performance.
A de-normalized form with redundant storage can sometimes speed up targeted queries, sometimes not, it depends. But it always affects everything else adversely. Redundant storage eats more of your available cache, so other data has to be evicted sooner. Even if you gain something locally, you may lose overall.
In this particular case it would also be considerably cheaper to get distinct values for a city_id int column, because integer values are smaller and faster to compare than (potentially long) strings. A multicolumn index on (id, city_id) in company would be smaller than the same for (id, city_name) and faster to process. One more join after folding many duplicates is comparatively cheap.

If you need top performance, you can always add a MATERIALIZED VIEW for the special purpose with pre-computed results (readily aggregated and with an index on industry_id), but avoid massively redundant data in your prime tables.